[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all

2019-06-24 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871656#comment-16871656
 ] 

Joseph Lynch edited comment on CASSANDRA-15175 at 6/24/19 11:22 PM:


I have completed the {{LOCAL_ONE}} scaling test. I have summarized the test in 
the following graph:
 !trunk_vs_30x_summary.png!

As we can see, even with the extra TLS CPU requirements, trunk was able to 
significantly outperform the status quo 3.0.x cluster across the load spectrum 
for this consistency level

I am proceeding with other consistency levels and gathering additional data.

So far I have noticed the following issues during these tests which I will 
gather more data on and follow up with in other tickets (and edit here with 
ticket numbers once I have them):
 # JDK Netty TLS appears significantly more CPU intensive than the previous 
Java Sockets implementation. [~norman] is taking a look from the Netty side and 
we can follow up and make sure we're not creating improperly (looking at the 
flamegraphs it looks like we may have a buffer sizing issue)
 # When a node was terminated and replaced, the new node appeared to sit for a 
very long time waiting for schema pulls to complete (I think it was waiting on 
the node it was replacing but I haven't fully debugged this).
 # Nodetool netstats doesn't report progress properly for the file count 
(percent, single file, and size still seem right; this is probably 
CASSANDRA-14192
 # When we re-load NTS keyspaces from disk we throw warnings about "Ignoring 
Unrecognized strategy option" for datacenters that we are not in
 # After a node shuts down there is a burst of re-connections on the urgent 
port prior to actual shutdown (I _think_ this is pre-existing and I'm just 
noticing it because of the new logging)

Also while setting up the {{LOCAL_QUORUM}} test I found the following trying to 
understand why I was seeing a higher number of blocking read repairs on the 
trunk cluster than the 30x cluster:
 # -When I stop and start nodes, it appears that hints may not always playback. 
In particular the high blocking read repairs were coming from neighbors of the 
node I had restarted a few times to test tcnative openssl integration. I 
checked the neighbor's hints directories and sure enough there were pending 
hints there that were not playing at all (they had been there for over 8 hours 
and still not played).- (Edit: This is a bad default. The default 
hinted_handoff_throttle_in_kb is 1024 but it is divided by the number of nodes 
in the cluster. In this case the size of 192 meant we were playing hints at a 
rate of ~5 kbps, which meant if we were down for even a few minutes we would 
essentially lose those mutations before the 24 hour hint expiry window)
 # -Repair appears to fail on the default system_traces when run with {{-full}} 
and {{-os}- (Edit: this is operator error, we shouldn't pass -local to a 
SimpleStrategy keyspace)
{noformat}
cass-perf-trunk-14746--useast1c-i-00a32889835534b75:~$ nodetool repair -os 
-full -local
[2019-06-23 23:29:30,210] Starting repair command #1 
(bfbc7ba0-960e-11e9-b238-77fd1c2e9b1c), repairing keyspace perftest with repair 
options (parallelism: parallel, primary range: false, incremental: false, job 
threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], 
previewKind: NONE, # of ranges: 6, pull repair: false, force repair: false, 
optimise streams: true)
[2019-06-23 23:52:08,248] Repair session c0573500-960e-11e9-b238-77fd1c2e9b1c 
for range [(384307168575030403,384307170010857891], 
(192153585909716729,384307168575030403]] finished (progress: 10%)
[2019-06-23 23:52:26,393] Repair session c0307320-960e-11e9-b238-77fd1c2e9b1c 
for range [(1808575567,192153584473889241], 
(192153584473889241,192153585909716729]] finished (progress: 20%)
[2019-06-23 23:52:28,298] Repair session c059f420-960e-11e9-b238-77fd1c2e9b1c 
for range [(576460752676171565,576460754111999053], 
(384307170010857891,576460752676171565]] finished (progress: 30%)
[2019-06-23 23:52:28,302] Repair completed successfully
[2019-06-23 23:52:28,310] Repair command #1 finished in 22 minutes 58 seconds
[2019-06-23 23:52:28,331] Replication factor is 1. No repair is needed for 
keyspace 'system_auth'
[2019-06-23 23:52:28,350] Starting repair command #2 
(f52c1c70-9611-11e9-b238-77fd1c2e9b1c), repairing keyspace system_traces with 
repair options (parallelism: parallel, primary range: false, incremental: 
false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], 
previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, 
optimise streams: true)
[2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not 
be empty
[2019-06-23 23:52:28,351] Repair command #2 finished with error
error: Repair job has failed with the error message: [2019-06-23 23:52:28,351] 
Repair command #2

[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all

2019-06-24 Thread Norman Maurer (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871778#comment-16871778
 ] 

Norman Maurer edited comment on CASSANDRA-15175 at 6/24/19 9:10 PM:


[~jolynch] Yes please use a non GCM cipher and report back :) And please ensure 
you use the same ciphers when comparing 3.x vs trunk as otherwise there is 
really no way to compare these at all (from my understanding you use different 
ciphers maybe)


was (Author: norman):
[~jolynch] Yes please use a non GCM cipher and report back :)

> Evaluate 200 node, compression=on, encryption=all
> -
>
> Key: CASSANDRA-15175
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15175
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Test/benchmark
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Normal
> Attachments: 30x_14400cRPS-14400cWPS.svg, ShortbufferExceptions.png, 
> odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, 
> trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, 
> trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, 
> trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, 
> trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, 
> trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, 
> trunk_vs_30x_14kcRPS_14kcWPS.png, 
> trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, 
> trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, 
> trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, 
> trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, 
> trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_summary.png
>
>
> Tracks evaluating a 192 node cluster with compression and encryption on.
> Test setup at (reproduced below)
> [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053]
>  
> |Test Setup| |
> |Baseline|3.0.19
> @d7d00036|
> |Candiate|trunk
> @abb0e177|
> | | |
> |Workload| |
> |Write size|4kb random|
> |Read size|4kb random|
> |Per Node Data|110GiB|
> |Generator|ndbench|
> |Key Distribution|Uniform|
> |SSTable Compr|Off|
> |Internode TLS|On (jdk)|
> |Internode Compr|On|
> |Compaction|LCS (320 MiB)|
> |Repair|Off|
> | | |
> |Hardware| |
> |Instance Type|i3.xlarge|
> |Deployment|96 us-east-1, 96 eu-west-1|
> |Region node count|96|
> | | |
> |OS Settings| |
> |IO scheduler|kyber|
> |Net qdisc|tc-fq|
> |readahead|32kb|
> |Java Version|OpenJDK 1.8.0_202 (Zulu)|
> | | |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all

2019-06-24 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16871656#comment-16871656
 ] 

Joseph Lynch edited comment on CASSANDRA-15175 at 6/24/19 7:23 PM:
---

I have completed the {{LOCAL_ONE}} scaling test. I have summarized the test in 
the following graph:
 !trunk_vs_30x_summary.png!

As we can see, even with the extra TLS CPU requirements, trunk was able to 
significantly outperform the status quo 3.0.x cluster across the load spectrum 
for this consistency level

I am proceeding with other consistency levels and gathering additional data.

So far I have noticed the following issues during these tests which I will 
gather more data on and follow up with in other tickets (and edit here with 
ticket numbers once I have them):
 # JDK Netty TLS appears significantly more CPU intensive than the previous 
Java Sockets implementation. [~norman] is taking a look from the Netty side and 
we can follow up and make sure we're not creating improperly (looking at the 
flamegraphs it looks like we may have a buffer sizing issue)
 # When a node was terminated and replaced, the new node appeared to sit for a 
very long time waiting for schema pulls to complete (I think it was waiting on 
the node it was replacing but I haven't fully debugged this).
 # Nodetool netstats doesn't report progress properly for the file count 
(percent, single file, and size still seem right; this is probably 
CASSANDRA-14192
 # When we re-load NTS keyspaces from disk we throw warnings about "Ignoring 
Unrecognized strategy option" for datacenters that we are not in
 # After a node shuts down there is a burst of re-connections on the urgent 
port prior to actual shutdown (I _think_ this is pre-existing and I'm just 
noticing it because of the new logging)

Also while setting up the {{LOCAL_QUORUM}} test I found the following trying to 
understand why I was seeing a higher number of blocking read repairs on the 
trunk cluster than the 30x cluster:
 # When I stop and start nodes, it appears that hints may not always playback. 
In particular the high blocking read repairs were coming from neighbors of the 
node I had restarted a few times to test tcnative openssl integration. I 
checked the neighbor's hints directories and sure enough there were pending 
hints there that were not playing at all (they had been there for over 8 hours 
and still not played).
 # -Repair appears to fail on the default system_traces when run with {{-full}} 
and {{-os}- (Edit: this is operator error, we shouldn't pass -local to a 
SimpleStrategy keyspace)
{noformat}
cass-perf-trunk-14746--useast1c-i-00a32889835534b75:~$ nodetool repair -os 
-full -local
[2019-06-23 23:29:30,210] Starting repair command #1 
(bfbc7ba0-960e-11e9-b238-77fd1c2e9b1c), repairing keyspace perftest with repair 
options (parallelism: parallel, primary range: false, incremental: false, job 
threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], 
previewKind: NONE, # of ranges: 6, pull repair: false, force repair: false, 
optimise streams: true)
[2019-06-23 23:52:08,248] Repair session c0573500-960e-11e9-b238-77fd1c2e9b1c 
for range [(384307168575030403,384307170010857891], 
(192153585909716729,384307168575030403]] finished (progress: 10%)
[2019-06-23 23:52:26,393] Repair session c0307320-960e-11e9-b238-77fd1c2e9b1c 
for range [(1808575567,192153584473889241], 
(192153584473889241,192153585909716729]] finished (progress: 20%)
[2019-06-23 23:52:28,298] Repair session c059f420-960e-11e9-b238-77fd1c2e9b1c 
for range [(576460752676171565,576460754111999053], 
(384307170010857891,576460752676171565]] finished (progress: 30%)
[2019-06-23 23:52:28,302] Repair completed successfully
[2019-06-23 23:52:28,310] Repair command #1 finished in 22 minutes 58 seconds
[2019-06-23 23:52:28,331] Replication factor is 1. No repair is needed for 
keyspace 'system_auth'
[2019-06-23 23:52:28,350] Starting repair command #2 
(f52c1c70-9611-11e9-b238-77fd1c2e9b1c), repairing keyspace system_traces with 
repair options (parallelism: parallel, primary range: false, incremental: 
false, job threads: 1, ColumnFamilies: [], dataCenters: [us-east-1], hosts: [], 
previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, 
optimise streams: true)
[2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not 
be empty
[2019-06-23 23:52:28,351] Repair command #2 finished with error
error: Repair job has failed with the error message: [2019-06-23 23:52:28,351] 
Repair command #2 failed with error Endpoints can not be empty. Check the logs 
on the repair participants for further details
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message: 
[2019-06-23 23:52:28,351] Repair command #2 failed with error Endpoints can not 
be empty. Check the logs on the repair participants for further details