[ https://issues.apache.org/jira/browse/CASSANDRA-14764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-14764: ----------------------------------------- Component/s: Streaming and Messaging > Evaluate 12 Node Breaking Point, compression=none, encryption=none, > coalescing=off > ---------------------------------------------------------------------------------- > > Key: CASSANDRA-14764 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14764 > Project: Cassandra > Issue Type: Sub-task > Components: Streaming and Messaging > Reporter: Joseph Lynch > Priority: Major > Attachments: i-03341e1c52de6ea3e-after-queue-change.svg, > i-07cd92e844d66d801-after-queue-bound.svg, i-07cd92e844d66d801-hint-play.svg, > i-07cd92e844d66d801-uninlined-with-jvm-methods.svg, ttop.txt > > > *Setup:* > * Cassandra: 12 (2*6) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) > running cassandra trunk off of jasobrown/14503 jdd7ec5a2 (Jasons patched > internode messaging branch) vs the same footprint running 3.0.17 > * Two datacenters with 100ms latency between them > * No compression, encryption, or coalescing turned on > *Test #1:* > ndbench sent 1.5k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so > 3k global replica QPS) of 4kb single partition BATCH mutations at LOCAL_ONE. > This represents about 250 QPS per coordinator in the first datacenter or 60 > QPS per core. The goal was to observe P99 write and read latencies under > various QPS. > *Result:* > The good news is since the CASSANDRA-14503 changes, instead of keeping the > mutations on heap we put the message into hints instead and don't run out of > memory. The bad news is that the {{MessagingService-NettyOutbound-Thread's}} > would occasionally enter a degraded state where they would just spin on a > core. I've attached flame graphs showing the CPU state as [~jasobrown] > applied fixes to the {{OutboundMessagingConnection}} class. > *Follow Ups:* > [~jasobrown] has committed a number of fixes onto his > {{jasobrown/14503-collab}} branch including: > 1. Limiting the amount of time spent dequeuing messages if they are expired > (previously if messages entered the queue faster than we could dequeue them > we'd just inifinte loop on the consumer side) > 2. Don't call {{dequeueMessages}} from within {{dequeueMessages}} created > callbacks. > We're continuing to use CPU flamegraphs to figure out where we're looping and > fixing bugs as we find them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org