[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871656#comment-16871656 ] Joseph Lynch edited comment on CASSANDRA-15175 at 6/24/19 11:22 PM: I have completed the {{LOCAL_ONE}} scaling test. I have summarized the test in the following graph: !trunk_vs_30x_summary.png! As we can see, even with the extra TLS CPU requirements, trunk was able to significantly outperform the status quo 3.0.x cluster across the load spectrum for this consistency level I am proceeding with other consistency levels and gathering additional data. So far I have noticed the following issues during these tests which I will gather more data on and follow up with in other tickets (and edit here with ticket numbers once I have them): # JDK Netty TLS appears significantly more CPU intensive than the previous Java Sockets implementation. [~norman] is taking a look from the Netty side and we can follow up and make sure we're not creating improperly (looking at the flamegraphs it looks like we may have a buffer sizing issue) # When a node was terminated and replaced, the new node appeared to sit for a very long time waiting for schema pulls to complete (I think it was waiting on the node it was replacing but I haven't fully debugged this). # Nodetool netstats doesn't report progress properly for the file count (percent, single file, and size still seem right; this is probably CASSANDRA-14192 # When we re-load NTS keyspaces from disk we throw warnings about "Ignoring Unrecognized strategy option" for datacenters that we are not in # After a node shuts down there is a burst of re-connections on the urgent port prior to actual shutdown (I _think_ this is pre-existing and I'm just noticing it because of the new logging) Also while setting up the {{LOCAL_QUORUM}} test I found the following trying to understand why I was seeing a higher number of blocking read repairs on the trunk cluster than the 30x cluster: # -When I stop and start nodes, it appears that hints may not always playback. In particular the high blocking read repairs were coming from neighbors of the node I had restarted a few times to test tcnative openssl integration. I checked the neighbor's hints directories and sure enough there were pending hints there that were not playing at all (they had been there for over 8 hours and still not played).- (Edit: This is a bad default. The default hinted_handoff_throttle_in_kb is 1024 but it is divided by the number of nodes in the cluster. In this case the size of 192 meant we were playing hints at a rate of ~5 kbps, which meant if we were down for even a few minutes we would essentially lose those mutations before the 24 hour hint expiry window) # -Repair appears to fail on the default system_traces when run with {{-full}} and {{-os}- (Edit: this is operator error, we shouldn't pass -local to a SimpleStrategy keyspace) {noformat} cass-perf-trunk-14746--useast1c-i-00a32889835534b75:~$ nodetool repair -os -full -local [2019-06-23 23:29:30,210] Starting repair command #1 (bfbc7ba0-960e-11e9-b238-77fd1c2e9b1c), repairing keyspace perftest with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], previewKind: NONE, # of ranges: 6, pull repair: false, force repair: false, optimise streams: true) [2019-06-23 23:52:08,248] Repair session c0573500-960e-11e9-b238-77fd1c2e9b1c for range [(384307168575030403,384307170010857891], (192153585909716729,384307168575030403]] finished (progress: 10%) [2019-06-23 23:52:26,393] Repair session c0307320-960e-11e9-b238-77fd1c2e9b1c for range [(1808575567,192153584473889241], (192153584473889241,192153585909716729]] finished (progress: 20%) [2019-06-23 23:52:28,298] Repair session c059f420-960e-11e9-b238-77fd1c2e9b1c for range [(576460752676171565,576460754111999053], (384307170010857891,576460752676171565]] finished (progress: 30%) [2019-06-23 23:52:28,302] Repair completed successfully [2019-06-23 23:52:28,310] Repair command #1 finished in 22 minutes 58 seconds [2019-06-23 23:52:28,331] Replication factor is 1. No repair is needed for keyspace 'system_auth' [2019-06-23 23:52:28,350] Starting repair command #2 (f52c1c70-9611-11e9-b238-77fd1c2e9b1c), repairing keyspace system_traces with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: true) [2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not be empty [2019-06-23 23:52:28,351] Repair command #2 finished with error error: Repair job has failed with the error message: [2019-06-23 23:52:28,351] Repair command #2
[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871778#comment-16871778 ] Norman Maurer edited comment on CASSANDRA-15175 at 6/24/19 9:10 PM: [~jolynch] Yes please use a non GCM cipher and report back :) And please ensure you use the same ciphers when comparing 3.x vs trunk as otherwise there is really no way to compare these at all (from my understanding you use different ciphers maybe) was (Author: norman): [~jolynch] Yes please use a non GCM cipher and report back :) > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, ShortbufferExceptions.png, > odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, > trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, > trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, > trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, > trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, > trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871656#comment-16871656 ] Joseph Lynch edited comment on CASSANDRA-15175 at 6/24/19 7:23 PM: --- I have completed the {{LOCAL_ONE}} scaling test. I have summarized the test in the following graph: !trunk_vs_30x_summary.png! As we can see, even with the extra TLS CPU requirements, trunk was able to significantly outperform the status quo 3.0.x cluster across the load spectrum for this consistency level I am proceeding with other consistency levels and gathering additional data. So far I have noticed the following issues during these tests which I will gather more data on and follow up with in other tickets (and edit here with ticket numbers once I have them): # JDK Netty TLS appears significantly more CPU intensive than the previous Java Sockets implementation. [~norman] is taking a look from the Netty side and we can follow up and make sure we're not creating improperly (looking at the flamegraphs it looks like we may have a buffer sizing issue) # When a node was terminated and replaced, the new node appeared to sit for a very long time waiting for schema pulls to complete (I think it was waiting on the node it was replacing but I haven't fully debugged this). # Nodetool netstats doesn't report progress properly for the file count (percent, single file, and size still seem right; this is probably CASSANDRA-14192 # When we re-load NTS keyspaces from disk we throw warnings about "Ignoring Unrecognized strategy option" for datacenters that we are not in # After a node shuts down there is a burst of re-connections on the urgent port prior to actual shutdown (I _think_ this is pre-existing and I'm just noticing it because of the new logging) Also while setting up the {{LOCAL_QUORUM}} test I found the following trying to understand why I was seeing a higher number of blocking read repairs on the trunk cluster than the 30x cluster: # When I stop and start nodes, it appears that hints may not always playback. In particular the high blocking read repairs were coming from neighbors of the node I had restarted a few times to test tcnative openssl integration. I checked the neighbor's hints directories and sure enough there were pending hints there that were not playing at all (they had been there for over 8 hours and still not played). # -Repair appears to fail on the default system_traces when run with {{-full}} and {{-os}- (Edit: this is operator error, we shouldn't pass -local to a SimpleStrategy keyspace) {noformat} cass-perf-trunk-14746--useast1c-i-00a32889835534b75:~$ nodetool repair -os -full -local [2019-06-23 23:29:30,210] Starting repair command #1 (bfbc7ba0-960e-11e9-b238-77fd1c2e9b1c), repairing keyspace perftest with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], previewKind: NONE, # of ranges: 6, pull repair: false, force repair: false, optimise streams: true) [2019-06-23 23:52:08,248] Repair session c0573500-960e-11e9-b238-77fd1c2e9b1c for range [(384307168575030403,384307170010857891], (192153585909716729,384307168575030403]] finished (progress: 10%) [2019-06-23 23:52:26,393] Repair session c0307320-960e-11e9-b238-77fd1c2e9b1c for range [(1808575567,192153584473889241], (192153584473889241,192153585909716729]] finished (progress: 20%) [2019-06-23 23:52:28,298] Repair session c059f420-960e-11e9-b238-77fd1c2e9b1c for range [(576460752676171565,576460754111999053], (384307170010857891,576460752676171565]] finished (progress: 30%) [2019-06-23 23:52:28,302] Repair completed successfully [2019-06-23 23:52:28,310] Repair command #1 finished in 22 minutes 58 seconds [2019-06-23 23:52:28,331] Replication factor is 1. No repair is needed for keyspace 'system_auth' [2019-06-23 23:52:28,350] Starting repair command #2 (f52c1c70-9611-11e9-b238-77fd1c2e9b1c), repairing keyspace system_traces with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: true) [2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not be empty [2019-06-23 23:52:28,351] Repair command #2 finished with error error: Repair job has failed with the error message: [2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not be empty. Check the logs on the repair participants for further details -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not be empty. Check the logs on the repair participants for further details