[ https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301500#comment-17301500 ]
Benjamin Lerer commented on CASSANDRA-16499: -------------------------------------------- Hi [~tomw_nz], I can try to help. I just need one or two days to catch up with the discussion and your findings as I have other stuff to work on too. Feel free to ping me on ASF slack if you want to in the meantime. :-) > single-threaded write workloads can spend ~70% time waiting on SEPExecutor > -------------------------------------------------------------------------- > > Key: CASSANDRA-16499 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16499 > Project: Cassandra > Issue Type: Improvement > Reporter: Tom Whitmore > Priority: Normal > Labels: performance > Attachments: AMI Linux test -- 09.txt, Cassandra Write trace 5; > warmed up -- 02.txt, MaybeStartSpinning Unpark fix on 4beta4; Cassandra > Stress results -- 01.txt, MaybeStartSpinning Unpark fix; Cassandra Stress > results -- 02.txt, MaybeStartSpinning Unpark fix; Linux benchmarks -- > 07.xlsx, SEPWorker trace 2 delay examples -- 01.txt, SEPWorker trace 2 > delays.txt, SEPWorker trace 3 delays; with proposed fix.txt, Single-thread > Latencies report -- 01.xlsx, Stress Write 2 sgl-thread vs 10 threads -- > 01.txt, Stress Write sgl-thread 1 -- 01.txt, Stress Write trace 1.txt, > proposed fix patches.zip, tracing & experimental change patches.zip > > > Hi all! While conducting benchmarking of Cassandra against other databases > for a particular healthcare solution, I found some surprising anomalies in > single-threaded write performance. > Analysis & tracing suggest there might be some inefficiencies in inter-thread > execution in Cassandra; > * Tracing showed an average delay of 1.52 ms between > StorageProxy.performLocally() being called, and the LocalMutationRunnable > actually executing. > * Total operation time averaged 2.06 ms (measured at Message.Dispatcher > processRequest()). This suggested ~72% of the +total operation time+ being > lost waiting for thread scheduling in SEPExecutor. > * With I tested with multiple threads, performance with 10 threads was 27x > higher. This supports a hypothesis that scheduling delays may be hindering > single-threaded progress. > * Transaction throughput for Cassandra with a single-threaded workload, > measured far lower than that of PostgreSQL on the same hardware. Postgres > achieved ~200k committed transactions/ minute including fsync; Cassandra > achieves ~37k per minute. Given they are both essentially writing to a commit > log, it may be informative to understand why the differences are arising. > Cassandra's architecture seems in theory like it might be aligned for our > usecase, given the Commit Log and Log Structured Merge design. Some of our > customers have configurations posing high single-threaded loads. Therefore I > spent some time trying to understand why efficiency for such loads seemed > less than expected. > My investigation so far: > * benchmarked Cassandra 3.11.10 > * stack-dumped it under load & identified a pattern of threads waiting in > AbstractWriteResponseHandler while nothing else is busy > * checked out Cassandra 3.11.10 source, built it, debugged & stepped > through key areas to try and understand behavior. > * instrumented key areas with custom tracing code & timestamps to 0.01 > millisecond. > ** _see patch attached._ > * benchmarked Cassandra 4 beta 4 & verified delays also present > * shown & traced delays with my healthcare scenario benchmark > * shown & traced delays with the +Cassandra stress-test+ tool. > The configuration was: > * single-node Cassandra running locally, on a recent Dell laptop with NVmE > SSD. > * for the healthcare scenario: > ** Java client app running 1 or 10 threads; > ** trialled LOCAL_ONE and ANY consistency levels; > ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED. > * for 'cassandra-stress': > ** cassandra-stress.bat write n=10000 -rate threads=1 > Without deeply understanding the code, I have considered a couple of possible > areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be > interested to understand whether or not these might be sound or not; note > that neither achieves as much improvement as might theoretically be hoped for. > My investigations based on the key observation of large delays between > StorageProxy.performLocally() being invoked and the LocalMutationRunnable > actually executing, for single-threaded workloads. > What I looked at: > * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true > to execute immediately, false if scheduled – for this workload scheduling > seemed slow. > ** takeWorkPermit() answers false if no work permits are available. > ** I noticed takeWorkPermit() also answers false if no task permits are > available, +even if no task permit need be taken.+ > ** by changing this condition I was able to gain +45% performance. > * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I > noted that in a single-thread workload SEPWorkers would likely spin & be put > to sleep for a period after completing each task. > ** I did wonder if the park -times- or parking behavior (empirically > observed at 10,000 - 20,000 nanos) could contribute to threads being more > aggressively de-scheduled. > ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained > +7.9% performance. > ** _Note: initial ticket misread code as requesting 500,000 nanosecond > sleeps. This has now been corrected._ > * Without deeply understanding SEP Executor/ Worker, I feel there may be > more questions around how SEP Workers are brought out of SPINNING/ sleep > state and whether this logic functions promptly & correctly. > ** At a very initial stage of investigation: +SEPWorker.assign() unparks > threads when transitioning out of STOPPED state, but code appears potentially > not to unpark threads coming out of SPINNING state.+ > ** _This is a very cursory "looking at the code" & initial debugging stage, > but I'm not certain it's accurate._ _Attempted experiments to unpark for the > SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine > CPU use or dropped messages, rather than helping anything._ > ** _If & when I can find out more, I'll post it here._ > I will post the tracing code & traces I captured, and welcome some feedback > and thoughts on these performance questions from the Cassandra dev community. > Thanks all! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org