[ https://issues.apache.org/jira/browse/CASSANDRA-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205922#comment-17205922 ]
Josh McKenzie commented on CASSANDRA-14746: ------------------------------------------- {quote}4.0 should have better latency, more throughput, fewer threads, fewer context switches, less GC allocation, and faster recovery time. {quote} Was this the goal of the MS rewrite? I have no horse in this race - I just thought the goal of it was to tighten up some of the things that were present / still troublesome after Jason's rewrite of things rather than specifically targeting performance improvements. I'd personally advocate for "no regression on categories a-e" with better backpressure, tolerance for failure, etc. etc. that I understood to come along w/the MS rewrite. At least in terms of what we should consider a blocker for 4.0, I think "don't regress" is a stance that makes sense, especially as incremental performance improvements are reasonable to consider for patch releases IMO. And fwiw, the benchmarks I've seen on 4.0 show a pretty significant improvement in throughput if nothing else, but in terms of bar - no regression for a rewrite seems like a good low water mark to block on. > Ensure Netty Internode Messaging Refactor is Solid > -------------------------------------------------- > > Key: CASSANDRA-14746 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14746 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Streaming and Messaging > Reporter: Joey Lynch > Assignee: Joey Lynch > Priority: Normal > Labels: 4.0-QA > Fix For: 4.0-beta, 4.0-triage > > > Before we release 4.0 let's ensure that the internode messaging refactor is > 100% solid. As internode messaging is naturally used in many code paths and > widely configurable we have a large number of cluster configurations and test > configurations that must be vetted. > We plan to vary the following: > * Version of Cassandra 3.0.17 vs 4.0-alpha > * Cluster sizes with *multi-dc* deployments ranging from 6 - 100 nodes > * Client request rates varying between 1k QPS and 100k QPS of varying sizes > and shapes (BATCH, INSERT, SELECT point, SELECT range, etc ...) > * Internode compression > * Internode SSL (as well as openssl vs jdk) > * Internode Coalescing options > We are looking to measure the following as appropriate: > * Latency distributions of reads and writes (lower is better) > * Scaling limit, aka maximum throughput before violating p99 latency > deadline of 10ms @ LOCAL_QUORUM, on a fixed hardware deployment for 100% > writes, 100% reads and 50-50 writes+reads (higher is better) > * Thread counts (lower is better) > * Context switches (lower is better) > * On-CPU time of tasks (higher periods without context switch is better) > * GC allocation rates / throughput for a fixed size heap (lower allocation > better) > * Streaming recovery time for a single node failure, i.e. can Cassandra > saturate the NIC > > The goal is that 4.0 should have better latency, more throughput, fewer > threads, fewer context switches, less GC allocation, and faster recovery > time. I'm putting Jason Brown as the reviewer since he implemented most of > the internode refactor. > Current collaborators driving this QA task: Dinesh Joshi, Jordan West, Joey > Lynch (Netflix), Vinay Chella (Netflix) > Owning committer(s): Jason Brown -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org