[jira] [Commented] (CASSANDRA-16499) single-threaded write workloads can spend ~70% time waiting on SEPExecutor

Tom Whitmore (Jira) Sun, 14 Mar 2021 22:42:12 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301399#comment-17301399
 ]


Tom Whitmore commented on CASSANDRA-16499:
------------------------------------------

Hi [~benedict], [~mck], [~clohfink], all,

I'm pleased to have been able to contribute this time to investigating 
performance limitations & identifying this thread scheduling issue.

A few questions with regard to Benedict's comment of 11/March:
 # "We need to either prove the [earlier 
results|http://riptano.github.io/cassandra_performance/graph_v3/graph.html?stats=stats.4718.ec2.may12.threads-270-cql3_native_prepared.json&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=169.18&ymin=0&ymax=200556.4]
 were faulty" – the linked results appear to be for 270 threads. Sorry I'm not 
understanding, but how do these conflict or invalidate with this issue which 
has most effect at low thread-counts eg. 1 to 50 threads? Or are we looking at 
the wrong results?
 # "We also need to understand what has changed" – as above.
 # You mention wanting to understand what is happening. Is "absence of code in 
the SEP wakeup spinning threads when work is enqueued" understandable, as far 
as what the defect is and why it impacts performance? Lacking any information 
to the contrary, I would assume this simply be a minor oversight causing a 
latent performance defect. Which may have understandably been overlooked due to 
focus on concurrent workloads, but should now be fixed.
 # "We should be performing comparisons with plain TPE, SEP without any 
spinning, this and plain SEP" – a complete re-evaluation of thread pools seems 
likely a rather larger scope, than fixing the single identified performance 
issue with the existing SEP. I would also ask, what evidence do we have to 
motivate a larger effort? I don't personally have any numbers or insight to 
suggest that this larger scope of work would yield any better results than 
simply fixing the SEP.

As an architect I appreciate absolutely a need for diligence & avoidance of 
unexpected regression to Cassandra users. However from my perspective there 
seems likely to be a difference between A) the testing practically required to 
show resolution of an identifiable performance fault & reasonably ensure no 
regression, and B) a comprehensive architectural re-evaluation of all 
thread-pool options.

I would believe that given a point performance issue has been identified in 
SEP, it might not be completely unreasonable to test & implement a fix just for 
that scope. With suitable regression testing & diligence. 
 * Making the choice of thread-pool executor configurable is probably OK; it's 
marginally lower-risk. 
 * However my suggestion would be that, if possible, just making the existing 
SEP work properly for all cases would be a simpler and better option, and avoid 
accumulating complexity debt in the product.
 * Allowing a configurable switch to DebuggableThreadPoolExecutor relies on 
implicit assumptions that the DTPE performs reliably and relatively comparably 
for most workloads. I suppose since it's based on the standard Java TPE those 
are probably _reasonable_ assumptions, but I don't personally have any evidence 
that a switch wouldn't lower performance. 

I am pleased to have been able to recognize the problem and contribute 2 weeks 
of work to tracing it, identifying the root cause, producing a fix and 
performing a moderate initial level of testing. I think this should be a fair 
proportion of what is needed to get option A) across the line.

I would tend to see option B), the wider change of thread pooling as likely to 
need a rather deeper & more thorough testing than option A). I would tend to 
assume that such a change should test a wider range of workloads, cluster 
structures and platforms. However to me, this sounds like a rather larger 
project. And I would ask the question beforehand, what evidence do we actually 
have to motivate such a larger effort.

Currently, I might be able to contribute a very small amount of additional 
effort to advance option A) or similar. Given my company's priorities, however 
– especially the fact that we are only at an assessment stage with Cassandra – 
it's not clear that we have time to invest in a larger scoped effort.

So some specific questions:
 * What regression testing would be considered necessary for option A) – fixing 
the current SEP – to be acceptably de-risked? 
 * Would we consider a fixed SEP to be reasonable on it's own, or is 
configurable choice of executors a hard requirement? How much de-risking would 
be needed to avoid needing stage executors to be configurable?
 * If we want to offer choice, one potential approach: we could push the 
current behavior up as an ancestor from SharedExecutorPool (perhaps renaming it 
to LegacySharedExecutorPool), and have SharedExecutorPool as a subclass 
implementing the fix. It could then be configurable which executor to use 
(defaulting to the fixed version). Would that be an acceptable approach?
 * Who on the DataStax/ Cassandra project could potentially help take this 
forward?

Regards,
Tom

> single-threaded write workloads can spend ~70% time waiting on SEPExecutor
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16499
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16499
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Tom Whitmore
>            Priority: Normal
>              Labels: performance
>         Attachments: AMI Linux test -- 09.txt, Cassandra Write trace 5;  
> warmed up -- 02.txt, MaybeStartSpinning Unpark fix on 4beta4; Cassandra 
> Stress results -- 01.txt, MaybeStartSpinning Unpark fix; Cassandra Stress 
> results -- 02.txt, MaybeStartSpinning Unpark fix; Linux benchmarks -- 
> 07.xlsx, SEPWorker trace 2 delay examples -- 01.txt, SEPWorker trace 2 
> delays.txt, SEPWorker trace 3 delays;  with proposed fix.txt, Single-thread 
> Latencies report -- 01.xlsx, Stress Write 2 sgl-thread vs 10 threads -- 
> 01.txt, Stress Write sgl-thread 1 -- 01.txt, Stress Write trace 1.txt, 
> proposed fix patches.zip, tracing & experimental change patches.zip
>
>
> Hi all! While conducting benchmarking of Cassandra against other databases 
> for a particular healthcare solution, I found some surprising anomalies in 
> single-threaded write performance. 
> Analysis & tracing suggest there might be some inefficiencies in inter-thread 
> execution in Cassandra;
>  * Tracing showed an average delay of 1.52 ms between 
> StorageProxy.performLocally() being called, and the LocalMutationRunnable 
> actually executing.
>  * Total operation time averaged 2.06 ms (measured at Message.Dispatcher 
> processRequest()). This suggested ~72% of the +total operation time+ being 
> lost waiting for thread scheduling in SEPExecutor.
>  * With I tested with multiple threads, performance with 10 threads was 27x 
> higher. This supports a hypothesis that scheduling delays may be hindering 
> single-threaded progress.
>  * Transaction throughput for Cassandra with a single-threaded workload, 
> measured far lower than that of PostgreSQL on the same hardware. Postgres 
> achieved ~200k committed transactions/ minute including fsync; Cassandra 
> achieves ~37k per minute. Given they are both essentially writing to a commit 
> log, it may be informative to understand why the differences are arising.
> Cassandra's architecture seems in theory like it might be aligned for our 
> usecase, given the Commit Log and Log Structured Merge design. Some of our 
> customers have configurations posing high single-threaded loads. Therefore I 
> spent some time trying to understand why efficiency for such loads seemed 
> less than expected.
> My investigation so far:
>  * benchmarked Cassandra 3.11.10
>  * stack-dumped it under load & identified a pattern of threads waiting in 
> AbstractWriteResponseHandler while nothing else is busy
>  * checked out Cassandra 3.11.10 source, built it, debugged  & stepped 
> through key areas to try and understand behavior.
>  * instrumented key areas with custom tracing code & timestamps to 0.01 
> millisecond.
>  ** _see patch attached._
>  * benchmarked Cassandra 4 beta 4 & verified delays also present
>  * shown & traced delays with my healthcare scenario benchmark
>  * shown & traced delays with the +Cassandra stress-test+ tool.
> The configuration was:
>  * single-node Cassandra running locally, on a recent Dell laptop with NVmE 
> SSD.
>  * for the healthcare scenario:
>  ** Java client app running 1 or 10 threads;
>  ** trialled LOCAL_ONE and ANY consistency levels;
>  ** trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
>  * for 'cassandra-stress':
>  ** cassandra-stress.bat write n=10000 -rate threads=1
> Without deeply understanding the code, I have considered a couple of possible 
> areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be 
> interested to understand whether or not these might be sound or not; note 
> that neither achieves as much improvement as might theoretically be hoped for.
> My investigations based on the key observation of large delays between 
> StorageProxy.performLocally() being invoked and the LocalMutationRunnable 
> actually executing, for single-threaded workloads.
> What I looked at:
>  * Without fully understanding SEPExecutor.takeWorkPermit() – it answers true 
> to execute immediately, false if scheduled – for this workload scheduling 
> seemed slow.
>  ** takeWorkPermit() answers false if no work permits are available.
>  ** I noticed takeWorkPermit() also answers false if no task permits are 
> available, +even if no task permit need be taken.+
>  ** by changing this condition I was able to gain +45% performance.
>  * Without deeply understanding SEP Executor/ Worker or sleep algorithms, I 
> noted that in a single-thread workload SEPWorkers would likely spin & be put 
> to sleep for a period after completing each task.
>  ** I did wonder if the park -times- or parking behavior (empirically 
> observed at 10,000 - 20,000 nanos) could contribute to threads being more 
> aggressively de-scheduled.
>  ** an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained 
> +7.9% performance.
>  ** _Note: initial ticket misread code as requesting 500,000 nanosecond 
> sleeps. This has now been corrected._
>  * Without deeply understanding SEP Executor/ Worker, I feel there may be 
> more questions around how SEP Workers are brought out of SPINNING/ sleep 
> state and whether this logic functions promptly & correctly.
>  ** At a very initial stage of investigation: +SEPWorker.assign() unparks 
> threads when transitioning out of STOPPED state, but code appears potentially 
> not to unpark threads coming out of SPINNING state.+
>  ** _This is a very cursory "looking at the code" & initial debugging stage, 
> but I'm not certain it's accurate._ _Attempted experiments to unpark for the 
> SPINNING -> Work transition so far_ _have_ _caused lockups of 100% machine 
> CPU use or dropped messages, rather than helping anything._
>  ** _If & when I can find out more, I'll post it here._
> I will post the tracing code & traces I captured, and welcome some feedback 
> and thoughts on these performance questions from the Cassandra dev community. 
> Thanks all!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-16499) single-threaded write workloads can spend ~70% time waiting on SEPExecutor

Reply via email to