[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152006#comment-17152006 ]
Robert Stupp commented on CASSANDRA-15922: ------------------------------------------ +1 (assuming CI looks good and 3.11+3.0 back-ports are clean) > High CAS failures in NativeAllocator.Region.allocate(..) > --------------------------------------------------------- > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable > Reporter: Michael Semb Wever > Assignee: Michael Semb Wever > Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference<Region> currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal<AtomicReference<Region>> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 10000000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant drop, especially at low contention rates. > !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! > Repeating the test run show reliably similar results: [^Screen Shot > 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. > h4. Region Per Thread Experiments > Implementing Region Per Thread: see the > {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect > zero CAS failures of the life of a Region. For performance we see two orders > of magnitude lower times to fill up the Region (~420ms). > !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! > h4. Costs > Region per Thread is an unrealistic solution as it introduces many new issues > and problems, from increased memory use to leaking memory and GC issues. It > is better tackled as part of a TPC implementation. > The backoff approach is simple and elegant, and seems to improve throughput > in all situations. It does introduce context switches which may impact > throughput in some busy throughput scenarios, so this should to be tested > further. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org