[ https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637652#comment-16637652 ]
Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:31 AM: -------------------------------------------------------------------- [~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the {{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332] to 64KB. This works great if you have a low latency connection, but for long fat networks this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we just don't have enough bandwidth even though the actual link is very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput: {noformat} # On the eu-west-1 node X $ iperf -s -p 8080 ------------------------------------------------------------ Server listening on TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 4] local X port 8080 connected with Y port 26964 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.5 sec 506 MBytes 404 Mbits/sec [ 5] local X port 8080 connected with Y port 27050 [ 5] 0.0-10.5 sec 8.50 MBytes 6.81 Mbits/sec # On the us-east-1 node Y about 80ms away $ iperf -N -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 3] local Y port 26964 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 506 MBytes 421 Mbits/sec $ iperf -N -w 64K -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 128 KByte (WARNING: requested 64.0 KByte) ------------------------------------------------------------ [ 3] local Y port 27050 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 8.50 MBytes 7.03 Mbits/sec {noformat} So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get 7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues effectively grow without bound until we start dropping messages. I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff] which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely wonderful* now. Some ways that things are wonderful: 1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction (both in garbage creation and actual cpu time): {noformat} 2018-10-03T23:56:40.889+0000 Process summary process cpu=321.33% application cpu=301.46% (user=185.93% sys=115.52%) other: cpu=19.88% thread count: 274 GC time=5.27% (young=5.27%, old=0.00%) heap allocation rate 478mb/s safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms safe point sync time: 0.08% processing time: 5.38% (wallclock time) [000135] user=49.03% sys=11.84% alloc= 142mb/s - CompactionExecutor:1 [000136] user=44.60% sys=13.81% alloc= 133mb/s - CompactionExecutor:2 [000198] user= 0.00% sys=41.46% alloc= 4833b/s - NonPeriodicTasks:1 [000010] user= 9.56% sys= 0.67% alloc= 57mb/s - spectator-gauge-polling-0 [000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1 [000036] user= 0.00% sys= 8.98% alloc= 2598b/s - PERIODIC-COMMIT-LOG-SYNCER [000115] user= 5.74% sys= 2.22% alloc= 12mb/s - MessagingService-NettyInbound-Thread-3-1 [000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3 [000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2 [000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1 [000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2 [000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3 [000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4 [000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1 [000132] user= 2.21% sys= 2.04% alloc= 11mb/s - MessagingService-NettyInbound-Thread-3-2 [000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5 [000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4 [000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6 [000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3 [000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7 {noformat} The flamegraph is also nice and clean! [^4.0.15-after-sndbuf-fix.svg] 2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements: !trunk_vs_3.0.17_latency_under_load.png! 3. Cassandra trunk is no longer dropping any messages! was (Author: jolynch): [~jasobrown] Ok, I think we found the problem! In the new Netty code we explicitly set the {{SO_SNDBUF}} [of the outbound socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332] to 64KB. This works great if you have a low latency connection, but for long fat networks this is a serious issue as you restrict your bandwidth significantly due to a high [bandwidth delay product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests we've been running where we are trying to push a semi reasonable amount of traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we just don't have enough bandwidth even though the actual link is very high bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples throughput: {noformat} # On the eu-west-1 node X $ iperf -s -p 8080 ------------------------------------------------------------ Server listening on TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 4] local X port 8080 connected with Y port 26964 [ ID] Interval Transfer Bandwidth [ 4] 0.0-10.5 sec 506 MBytes 404 Mbits/sec [ 5] local X port 8080 connected with Y port 27050 [ 5] 0.0-10.5 sec 8.50 MBytes 6.81 Mbits/sec # On the us-east-1 node Y about 80ms away $ iperf -N -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 12.0 MByte (default) ------------------------------------------------------------ [ 3] local Y port 26964 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 506 MBytes 421 Mbits/sec $ iperf -N -w 64K -c X -p 8080 ------------------------------------------------------------ Client connecting to X, TCP port 8080 TCP window size: 128 KByte (WARNING: requested 64.0 KByte) ------------------------------------------------------------ [ 3] local Y port 27050 connected with X port 8080 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.1 sec 8.50 MBytes 7.03 Mbits/sec {noformat} So instead of Cassandra getting the full link's bandwidth of 500mbps we're only able to get 7mbps. This is lower than the 8mbps we need to push so the us-east-1 -> eu-west-1 queues effectively grow without bound until we start dropping messages. I applied a [patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff] which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is completely wonderful* now. Some ways that things are wonderful: 1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent in compaction (both in garbage creation and actual cpu time): {noformat} 2018-10-03T23:56:40.889+0000 Process summary process cpu=321.33% application cpu=301.46% (user=185.93% sys=115.52%) other: cpu=19.88% thread count: 274 GC time=5.27% (young=5.27%, old=0.00%) heap allocation rate 478mb/s safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms safe point sync time: 0.08% processing time: 5.38% (wallclock time) [000135] user=49.03% sys=11.84% alloc= 142mb/s - CompactionExecutor:1 [000136] user=44.60% sys=13.81% alloc= 133mb/s - CompactionExecutor:2 [000198] user= 0.00% sys=41.46% alloc= 4833b/s - NonPeriodicTasks:1 [000010] user= 9.56% sys= 0.67% alloc= 57mb/s - spectator-gauge-polling-0 [000029] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1 [000036] user= 0.00% sys= 8.98% alloc= 2598b/s - PERIODIC-COMMIT-LOG-SYNCER [000115] user= 5.74% sys= 2.22% alloc= 12mb/s - MessagingService-NettyInbound-Thread-3-1 [000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - MessagingService-NettyOutbound-Thread-4-3 [000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - MessagingService-NettyOutbound-Thread-4-2 [000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1 [000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2 [000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3 [000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4 [000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - MessagingService-NettyOutbound-Thread-4-1 [000132] user= 2.21% sys= 2.04% alloc= 11mb/s - MessagingService-NettyInbound-Thread-3-2 [000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5 [000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - MessagingService-NettyInbound-Thread-3-4 [000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - MutationStage-6 [000133] user= 1.81% sys= 1.83% alloc= 6236kb/s - MessagingService-NettyInbound-Thread-3-3 [000153] user= 2.32% sys= 1.31% alloc= 4538kb/s - MutationStage-7 {noformat} The flamegraph is also nice and clean! [^4.0.15-after-sndbuf-fix.svg] 2. The latencies for LOCAL_QUORUM are significantly better than 3.0.17. The average latency has improved from 2ms to 1.2ms (40% better!!) and the 99th percentile has reduced from close to 70ms to 25ms (60% better!!). I've attached a graph showing the significant latency improvements. !trunk_vs_3.0.17_latency_under_load.png|thumbnail! 3. Cassandra trunk is no longer dropping any messages! > Evaluate 200 node, compression=none, encryption=none, coalescing=off > --------------------------------------------------------------------- > > Key: CASSANDRA-14747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14747 > Project: Cassandra > Issue Type: Sub-task > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Priority: Major > Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, > 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, > 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, > 4.0_errors_showing_heap_pressure.txt, > 4.0_heap_histogram_showing_many_MessageOuts.txt, > i-0ed2acd2dfacab7c1-after-looping-fixes.svg, > trunk_vs_3.0.17_latency_under_load.png, > ttop_NettyOutbound-Thread_spinning.txt, > useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, > useast1e-i-08635fa1631601538_flamegraph_96node.svg, > useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, > useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg > > > Tracks evaluating a 200 node cluster with all internode settings off (no > compression, no encryption, no coalescing). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org