[ https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297722#comment-17297722 ]
Tom Whitmore commented on CASSANDRA-16499: ------------------------------------------ Hi [~benedict], [~clohfink] , thanks for the responses. Correction on park times in doWaitSpin() – as you say, these are about 10,000 - 20,000 nanos. My error misreading the code, and jumping to a hypothesis that it might be responsible for the large delays shown by the trace in LocalMutationRunnable execution. (I will correct the narrative of this ticket accordingly.) I am still unsure there is any corresponding 'unpark' for the 'park' in the spin case, but whether this is significant I'm no longer sure. I am still trying to understand the cause of the large delays I seem to observe in traces (1.52 ms average) and the low throughput figures achieved (274 op/sec). The key delay shown in my tracing seems to be between StorageProxy.performLocally() calling maybeExecuteImmediately(), and the LocalMutationRunnable actually running. To answer your questions: * What environment am I running Cassandra in? ** Dell laptop with Core i7 @ 2.6 GHz, 32 GB memory, Windows 10 ** AdoptOpenJDK 8 (jdk-8u282b08) for Cassandra 3.11.10 ** AdoptOpenJDK 11 (jdk-11.0.9.101-hotspot) for Cassandra 4 beta 4, with some slightly hacked-up startup scripts from Cassandra 3.11.10 * Am I implicitly sending many requests at once with Postgres? ** No, I have separate tests for 1 thread and 10 threads. The single-threaded workload loops N times (say 200k for Postgres), each loop runs a simple sequential transaction with a couple of inserts and then commits. ** I've verified I can achieve similar Postgres throughput when restricting the connection pool to one connection, and used 'netstat' to verify only one socket in use. Progress this morning: * I have debugged doWaitSpin() and empirically observe requested sleep times at 10,000 - 20,000 nanoseconds. * I trialled experimental code to 'unpark' the worker thread in SEPWorker.assign() when transitioning out of sleep, but so far I only see fractional (perhaps +1.4%) gains. Outstanding questions: * Why the ~1.5 ms delays shown in my traces, before executing LocalMutationRunnable via SEPExecutor? * Why does Cassandra Stress achieve such a low single-threaded write throughput (274 op/s) against a single local node? * Why the 27x throughput difference between my 1 and 10 thread tests? I would be really interested in any further thoughts or feedback. > single-threaded write workloads can spend ~70% time waiting on SEPExecutor > -------------------------------------------------------------------------- > > Key: CASSANDRA-16499 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16499 > Project: Cassandra > Issue Type: Bug > Reporter: Tom Whitmore > Priority: Normal > Attachments: Cassandra Write trace 5; warmed up -- 02.txt, > Single-thread Latencies report -- 01.xlsx, Stress Write sgl-thread 1 -- > 01.txt, Stress Write trace 1.txt, tracing & experimental change patches.zip > > > Hi all! While conducting benchmarking of Cassandra against other databases > for a particular healthcare solution, I found some surprising anomalies in > single-threaded write performance. > Analysis & tracing suggest there might be some inefficiencies in inter-thread > execution in Cassandra; > * Tracing showed an average delay of 1.52 ms between > StorageProxy.performLocally() being called, and the LocalMutationRunnable > actually executing. > * Total operation time averaged 2.06 ms (measured at Message.Dispatcher > processRequest()). This suggested ~72% of the +total operation time+ being > lost waiting for thread scheduling in SEPExecutor. > * With I tested with multiple threads, performance with 10 threads was 27x > higher. This supports a hypothesis that scheduling delays may be hindering > single-threaded progress. > * Transaction throughput for Cassandra with a single-threaded workload, > measured far lower than that of PostgreSQL on the same hardware. Postgres > achieved ~200k committed transactions/ minute including fsync; Cassandra > achieves ~37k per minute. Given they are both essentially writing to a commit > log, it may be informative to understand why the differences are arising. > Cassandra's architecture seems in theory like it might be aligned for our > usecase, given the Commit Log and Log Structured Merge design. Some of our > customers have configurations posing high single-threaded loads. Therefore I > spent some time trying to understand why efficiency for such loads seemed > less than expected. > My investigation so far: > * benchmarked Cassandra 3.11.10 > * stack-dumped it under load & identified a pattern of threads waiting in > AbstractWriteResponseHandler while nothing else is busy > * checked out Cassandra 3.11.10 source, built it, debugged & stepped > through key areas to try and understand behavior. > * instrumented key areas with custom tracing code & timestamps to 0.01 > millisecond. > ** _see patch attached._ > * benchmarked Cassandra 4 beta 4 & verified delays also present > * shown & traced delays with my healthcare scenario benchmark > * shown & traced delays with the +Cassandra stress-test+ tool. > The configuration was: > * single-node Cassandra running locally, on a recent Dell laptop with NVmE > SSD. > * for the healthcare scenario: > ** Java client app running 1 or 10 threads; > ** trialled LOCAL_ONE and ANY consistency levels; > ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED. > * for 'cassandra-stress': > ** cassandra-stress.bat write n=10000 -rate threads=1 > Without deeply understanding the code, I have considered a couple of possible > areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be > interested to understand whether or not these might be sound or not; note > that neither achieves as much improvement as might theoretically be hoped for. > My investigations based on the key observation of large delays between > StorageProxy.performLocally() being invoked and the LocalMutationRunnable > actually executing, for single-threaded workloads. > What I looked at: > * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true > to execute immediately, false if scheduled – for this workload scheduling > seemed slow. > ** takeWorkPermit() answers false if no work permits are available. > ** I noticed takeWorkPermit() also answers false if no task permits are > available, +even if no task permit need be taken.+ > ** by changing this condition I was able to gain +45% performance. > * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I > noted that in a single-thread workload SEPWorkers would likely spin & be put > to sleep for a fairly extended period after completing each task. > ** I did wonder if the relatively-long park times (average of 500,000 nanos > = 0.5ms) could contribute to threads being more aggressively de-scheduled. > ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained > +7.9% performance. > * Without deeply understanding SEP Executor/ Worker, I feel there may be > more questions around how SEP Workers are brought out of SPINNING/ sleep > state and whether this logic functions promptly & correctly. > ** At a very initial stage of investigation: +SEPWorker.assign() unparks > threads when transitioning out of STOPPED state, but code appears potentially > not to unpark threads coming out of SPINNING state.+ > ** _This is a very cursory "looking at the code" & initial debugging stage, > but I'm not certain it's accurate._ _Attempted experiments to unpark for the > SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine > CPU use or dropped messages, rather than helping anything._ > ** _If & when I can find out more, I'll post it here._ > I will post the tracing code & traces I captured, and welcome some feedback > and thoughts on these performance questions from the Cassandra dev community. > Thanks all! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org