[jira] [Commented] (CASSANDRA-18808) netty-handler vulnerability: CVE-2023-4586
[ https://issues.apache.org/jira/browse/CASSANDRA-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770135#comment-17770135 ] Norman Maurer commented on CASSANDRA-18808: --- Sorry I didn't have time yet but its on my todo list > netty-handler vulnerability: CVE-2023-4586 > -- > > Key: CASSANDRA-18808 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18808 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 5.0.x, 5.x > > > This is failing OWASP: > {noformat} > Dependency-Check Failure: > One or more dependencies were identified with vulnerabilities that have a > CVSS score greater than or equal to '1.0': > netty-handler-4.1.96.Final.jar: CVE-2023-4586 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18808) netty-handler vulnerability: CVE-2023-4586
[ https://issues.apache.org/jira/browse/CASSANDRA-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766866#comment-17766866 ] Norman Maurer commented on CASSANDRA-18808: --- Ok will do tomorrow latest. > netty-handler vulnerability: CVE-2023-4586 > -- > > Key: CASSANDRA-18808 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18808 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 5.0.x, 5.x > > > This is failing OWASP: > {noformat} > Dependency-Check Failure: > One or more dependencies were identified with vulnerabilities that have a > CVSS score greater than or equal to '1.0': > netty-handler-4.1.96.Final.jar: CVE-2023-4586 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18808) netty-handler vulnerability: CVE-2023-4586
[ https://issues.apache.org/jira/browse/CASSANDRA-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766864#comment-17766864 ] Norman Maurer commented on CASSANDRA-18808: --- I can also verify and do a PR if needed... just let me know > netty-handler vulnerability: CVE-2023-4586 > -- > > Key: CASSANDRA-18808 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18808 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 5.0.x, 5.x > > > This is failing OWASP: > {noformat} > Dependency-Check Failure: > One or more dependencies were identified with vulnerabilities that have a > CVSS score greater than or equal to '1.0': > netty-handler-4.1.96.Final.jar: CVE-2023-4586 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18808) netty-handler vulnerability: CVE-2023-4586
[ https://issues.apache.org/jira/browse/CASSANDRA-18808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766849#comment-17766849 ] Norman Maurer commented on CASSANDRA-18808: --- Netty does not enable hostname verification by default. You need to enable it by yourself. If you already have there is nothing you need to do. > netty-handler vulnerability: CVE-2023-4586 > -- > > Key: CASSANDRA-18808 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18808 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Brandon Williams >Assignee: Brandon Williams >Priority: Normal > Fix For: 5.0.x, 5.x > > > This is failing OWASP: > {noformat} > Dependency-Check Failure: > One or more dependencies were identified with vulnerabilities that have a > CVSS score greater than or equal to '1.0': > netty-handler-4.1.96.Final.jar: CVE-2023-4586 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18148) netty-all vulnerability: CVE-2022-41881
[ https://issues.apache.org/jira/browse/CASSANDRA-18148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676508#comment-17676508 ] Norman Maurer commented on CASSANDRA-18148: --- Your assumption is correct... Cassandra is not affected as it not use the decoder in question > netty-all vulnerability: CVE-2022-41881 > --- > > Key: CASSANDRA-18148 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18148 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Brandon Williams >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 4.x > > > This is showing in the OWASP scan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16277) 'SSLEngine closed already' exception on failed outbound connection
[ https://issues.apache.org/jira/browse/CASSANDRA-16277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17234313#comment-17234313 ] Norman Maurer commented on CASSANDRA-16277: --- LGTM as well +1 > 'SSLEngine closed already' exception on failed outbound connection > -- > > Key: CASSANDRA-16277 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16277 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Internode >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Normal > > Occasionally Netty will invoke > {{OutboundConnectionInitiator#exceptionCaught()}} handler to process an > exception of the following kind: > {code} > io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: > Connection reset by peer > {code} > When we invoke {{ctx.close()}} later in that method, the listener, set up in > {{channelActive()}}, might be > failed with an {{SSLException("SSLEngine closed already”)}} by Netty, and > {{exceptionCaught()}} will be invoked > once again, this time to handle the {{SSLException}} triggered by > {{ctx.close()}}. > The exception at this stage is benign, and we shouldn't be double-logging the > failure to connect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874475#comment-16874475 ] Norman Maurer commented on CASSANDRA-15175: --- [~jolynch] so is it fair to say that there is nothing for me to investigate for now on the Netty side ? > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, > 30x_LQ_21600cRPS-14400cWPS.svg, ShortbufferExceptions.png, > odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, > trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, > trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, > trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, > trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, > trunk_LQ_21600cRPS-14400cWPS.svg, trunk_vs_30x_125kcRPS_14kcWPS.png, > trunk_vs_30x_14kRPS_14kcWPS_load.png, trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_LQ_21kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_64kcRPS_14kcWPS.png, trunk_vs_30x_LQ_jdk_summary.png, > trunk_vs_30x_LQ_openssl_21kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_tcnative_summary.png, trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871778#comment-16871778 ] Norman Maurer edited comment on CASSANDRA-15175 at 6/24/19 9:10 PM: [~jolynch] Yes please use a non GCM cipher and report back :) And please ensure you use the same ciphers when comparing 3.x vs trunk as otherwise there is really no way to compare these at all (from my understanding you use different ciphers maybe) was (Author: norman): [~jolynch] Yes please use a non GCM cipher and report back :) > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, ShortbufferExceptions.png, > odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, > trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, > trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, > trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, > trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, > trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871778#comment-16871778 ] Norman Maurer commented on CASSANDRA-15175: --- [~jolynch] Yes please use a non GCM cipher and report back :) > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, ShortbufferExceptions.png, > odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, > trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, > trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, > trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, > trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, > trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871727#comment-16871727 ] Norman Maurer commented on CASSANDRA-15175: --- [~jolynch] yeah that is why I ask... From the OpenJDK code it seems like `ShortBufferException` should really "never happen". That is why I ask about errors. > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, ShortbufferExceptions.png, > odd_netty_jdk_tls_cpu_usage.png, trunk_14400cRPS-14400cWPS.svg, > trunk_187000cRPS-14400cWPS.svg, trunk_187kcRPS_14kcWPS.png, > trunk_22000cRPS-14400cWPS-jdk.svg, trunk_22000cRPS-14400cWPS-openssl.svg, > trunk_220kcRPS_14kcWPS.png, trunk_252kcRPS-14kcWPS.png, > trunk_93500cRPS-14400cWPS.svg, trunk_LQ_14400cRPS-14400cWPS.svg, > trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_LQ_14kcRPS_14kcWPS.png, trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871684#comment-16871684 ] Norman Maurer commented on CASSANDRA-15175: --- [~jolynch] one question... when using JDK TLS do you see any errors at all or you just see more CPU usage and thats it ? > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, > trunk_14400cRPS-14400cWPS.svg, trunk_187000cRPS-14400cWPS.svg, > trunk_187kcRPS_14kcWPS.png, trunk_22000cRPS-14400cWPS-jdk.svg, > trunk_22000cRPS-14400cWPS-openssl.svg, trunk_220kcRPS_14kcWPS.png, > trunk_252kcRPS-14kcWPS.png, trunk_93500cRPS-14400cWPS.svg, > trunk_vs_30x_125kcRPS_14kcWPS.png, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png, > trunk_vs_30x_summary.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On (jdk)| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870444#comment-16870444 ] Norman Maurer commented on CASSANDRA-15175: --- Los please include the cipher that is used > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, > trunk_14400cRPS-14400cWPS.svg, trunk_187000cRPS-14400cWPS.svg, > trunk_93500cRPS-14400cWPS.svg, trunk_vs_30x_125kcRPS_14kcWPS.png, > trunk_vs_30x_14kRPS_14kcWPS_load.png, trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_156kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS.png, > trunk_vs_30x_24kcRPS_14kcWPS_load.png, trunk_vs_30x_31kcRPS_14kcWPS.png, > trunk_vs_30x_62kcRPS_14kcWPS.png, trunk_vs_30x_93kcRPS_14kcWPS.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. > Test setup at (reproduced below) > [https://docs.google.com/spreadsheets/d/1Vq_wC2q-rcG7UWim-t2leZZ4GgcuAjSREMFbG0QGy20/edit#gid=1336583053] > > |Test Setup| | > |Baseline|3.0.19 > @d7d00036| > |Candiate|trunk > @abb0e177| > | | | > |Workload| | > |Write size|4kb random| > |Read size|4kb random| > |Per Node Data|110GiB| > |Generator|ndbench| > |Key Distribution|Uniform| > |SSTable Compr|Off| > |Internode TLS|On| > |Internode Compr|On| > |Compaction|LCS (320 MiB)| > |Repair|Off| > | | | > |Hardware| | > |Instance Type|i3.xlarge| > |Deployment|96 us-east-1, 96 eu-west-1| > |Region node count|96| > | | | > |OS Settings| | > |IO scheduler|kyber| > |Net qdisc|tc-fq| > |readahead|32kb| > |Java Version|OpenJDK 1.8.0_202 (Zulu)| > | | | -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870256#comment-16870256 ] Norman Maurer commented on CASSANDRA-15175: --- Can you provide the full JDK version as well ? > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, > trunk_14400cRPS-14400cWPS.svg, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_24kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS_load.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15175) Evaluate 200 node, compression=on, encryption=all
[ https://issues.apache.org/jira/browse/CASSANDRA-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870255#comment-16870255 ] Norman Maurer commented on CASSANDRA-15175: --- I will have a Look > Evaluate 200 node, compression=on, encryption=all > - > > Key: CASSANDRA-15175 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15175 > Project: Cassandra > Issue Type: Sub-task > Components: Test/benchmark >Reporter: Joseph Lynch >Assignee: Joseph Lynch >Priority: Normal > Attachments: 30x_14400cRPS-14400cWPS.svg, > trunk_14400cRPS-14400cWPS.svg, trunk_vs_30x_14kRPS_14kcWPS_load.png, > trunk_vs_30x_14kcRPS_14kcWPS.png, > trunk_vs_30x_14kcRPS_14kcWPS_schedstat_delays.png, > trunk_vs_30x_24kcRPS_14kcWPS.png, trunk_vs_30x_24kcRPS_14kcWPS_load.png > > > Tracks evaluating a 192 node cluster with compression and encryption on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587629#comment-16587629 ] Norman Maurer commented on CASSANDRA-13651: --- I am no committer but the netty project lead so from the point of view of netty usage I am also +1 on this. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary >Priority: Major > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with a deadline of 10 usec (5 per messages I think) but > netty+epoll only
[jira] [Commented] (CASSANDRA-13929) BTree$Builder / io.netty.util.Recycler$Stack leaking memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16374138#comment-16374138 ] Norman Maurer commented on CASSANDRA-13929: --- Just a general comment The recycler only makes sense to use if creating the object is considered very expensive and or if you create / destroy a lot of these very frequently. Which means usually thousands per second. So if this is not the case here I think it completely reasonable to not use the Recycler at all... As I have no idea really about the use-case I am just leave this here as general comment :) > BTree$Builder / io.netty.util.Recycler$Stack leaking memory > --- > > Key: CASSANDRA-13929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13929 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Assignee: Jay Zhuang >Priority: Major > Fix For: 3.11.x > > Attachments: cassandra_3.11.0_min_memory_utilization.jpg, > cassandra_3.11.1_NORECYCLE_memory_utilization.jpg, > cassandra_3.11.1_mat_dominator_classes.png, > cassandra_3.11.1_mat_dominator_classes_FIXED.png, > cassandra_3.11.1_snapshot_heaputilization.png, > cassandra_3.11.1_vs_3.11.2recyclernullingpatch.png, > dtest_example_80_request.png, dtest_example_80_request_fix.png, > dtest_example_heap.png, memleak_heapdump_recyclerstack.png > > > Different to CASSANDRA-13754, there seems to be another memory leak in > 3.11.0+ in BTree$Builder / io.netty.util.Recycler$Stack. > * heap utilization increase after upgrading to 3.11.0 => > cassandra_3.11.0_min_memory_utilization.jpg > * No difference after upgrading to 3.11.1 (snapshot build) => > cassandra_3.11.1_snapshot_heaputilization.png; thus most likely after fixing > CASSANDRA-13754, more visible now > * MAT shows io.netty.util.Recycler$Stack as top contributing class => > cassandra_3.11.1_mat_dominator_classes.png > * With -Xmx8G (CMS) and our load pattern, we have to do a rolling restart > after ~ 72 hours > Verified the following fix, namely explicitly unreferencing the > _recycleHandle_ member (making it non-final). In > _org.apache.cassandra.utils.btree.BTree.Builder.recycle()_ > {code} > public void recycle() > { > if (recycleHandle != null) > { > this.cleanup(); > builderRecycler.recycle(this, recycleHandle); > recycleHandle = null; // ADDED > } > } > {code} > Patched a single node in our loadtest cluster with this change and after ~ 10 > hours uptime, no sign of the previously offending class in MAT anymore => > cassandra_3.11.1_mat_dominator_classes_FIXED.png > Can' say if this has any other side effects etc., but I doubt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13929) BTree$Builder / io.netty.util.Recycler$Stack leaking memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363974#comment-16363974 ] Norman Maurer commented on CASSANDRA-13929: --- [~tsteinmaurer] yeah thanks I see... So looks more like a misusage for me then a netty bug. Cassandra may also consider to configure the `Recycler` with a more sane default value for this use-case (via the constructor). > BTree$Builder / io.netty.util.Recycler$Stack leaking memory > --- > > Key: CASSANDRA-13929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13929 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Assignee: Jay Zhuang >Priority: Major > Fix For: 3.11.x > > Attachments: cassandra_3.11.0_min_memory_utilization.jpg, > cassandra_3.11.1_NORECYCLE_memory_utilization.jpg, > cassandra_3.11.1_mat_dominator_classes.png, > cassandra_3.11.1_mat_dominator_classes_FIXED.png, > cassandra_3.11.1_snapshot_heaputilization.png, > cassandra_3.11.1_vs_3.11.2recyclernullingpatch.png, > dtest_example_80_request.png, dtest_example_80_request_fix.png, > dtest_example_heap.png, memleak_heapdump_recyclerstack.png > > > Different to CASSANDRA-13754, there seems to be another memory leak in > 3.11.0+ in BTree$Builder / io.netty.util.Recycler$Stack. > * heap utilization increase after upgrading to 3.11.0 => > cassandra_3.11.0_min_memory_utilization.jpg > * No difference after upgrading to 3.11.1 (snapshot build) => > cassandra_3.11.1_snapshot_heaputilization.png; thus most likely after fixing > CASSANDRA-13754, more visible now > * MAT shows io.netty.util.Recycler$Stack as top contributing class => > cassandra_3.11.1_mat_dominator_classes.png > * With -Xmx8G (CMS) and our load pattern, we have to do a rolling restart > after ~ 72 hours > Verified the following fix, namely explicitly unreferencing the > _recycleHandle_ member (making it non-final). In > _org.apache.cassandra.utils.btree.BTree.Builder.recycle()_ > {code} > public void recycle() > { > if (recycleHandle != null) > { > this.cleanup(); > builderRecycler.recycle(this, recycleHandle); > recycleHandle = null; // ADDED > } > } > {code} > Patched a single node in our loadtest cluster with this change and after ~ 10 > hours uptime, no sign of the previously offending class in MAT anymore => > cassandra_3.11.1_mat_dominator_classes_FIXED.png > Can' say if this has any other side effects etc., but I doubt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13929) BTree$Builder / io.netty.util.Recycler$Stack leaking memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16363597#comment-16363597 ] Norman Maurer commented on CASSANDRA-13929: --- [~jay.zhuang] I would be interested what is contained in the `Stack` itself > BTree$Builder / io.netty.util.Recycler$Stack leaking memory > --- > > Key: CASSANDRA-13929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13929 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Assignee: Jay Zhuang >Priority: Major > Fix For: 3.11.x > > Attachments: cassandra_3.11.0_min_memory_utilization.jpg, > cassandra_3.11.1_NORECYCLE_memory_utilization.jpg, > cassandra_3.11.1_mat_dominator_classes.png, > cassandra_3.11.1_mat_dominator_classes_FIXED.png, > cassandra_3.11.1_snapshot_heaputilization.png, > cassandra_3.11.1_vs_3.11.2recyclernullingpatch.png, > dtest_example_80_request.png, dtest_example_80_request_fix.png, > dtest_example_heap.png > > > Different to CASSANDRA-13754, there seems to be another memory leak in > 3.11.0+ in BTree$Builder / io.netty.util.Recycler$Stack. > * heap utilization increase after upgrading to 3.11.0 => > cassandra_3.11.0_min_memory_utilization.jpg > * No difference after upgrading to 3.11.1 (snapshot build) => > cassandra_3.11.1_snapshot_heaputilization.png; thus most likely after fixing > CASSANDRA-13754, more visible now > * MAT shows io.netty.util.Recycler$Stack as top contributing class => > cassandra_3.11.1_mat_dominator_classes.png > * With -Xmx8G (CMS) and our load pattern, we have to do a rolling restart > after ~ 72 hours > Verified the following fix, namely explicitly unreferencing the > _recycleHandle_ member (making it non-final). In > _org.apache.cassandra.utils.btree.BTree.Builder.recycle()_ > {code} > public void recycle() > { > if (recycleHandle != null) > { > this.cleanup(); > builderRecycler.recycle(this, recycleHandle); > recycleHandle = null; // ADDED > } > } > {code} > Patched a single node in our loadtest cluster with this change and after ~ 10 > hours uptime, no sign of the previously offending class in MAT anymore => > cassandra_3.11.1_mat_dominator_classes_FIXED.png > Can' say if this has any other side effects etc., but I doubt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13929) BTree$Builder / io.netty.util.Recycler$Stack leaking memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348723#comment-16348723 ] Norman Maurer commented on CASSANDRA-13929: --- [~tsteinmaurer] what about a heap dump ? Is this something you could provide ? > BTree$Builder / io.netty.util.Recycler$Stack leaking memory > --- > > Key: CASSANDRA-13929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13929 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Priority: Major > Fix For: 3.11.x > > Attachments: cassandra_3.11.0_min_memory_utilization.jpg, > cassandra_3.11.1_NORECYCLE_memory_utilization.jpg, > cassandra_3.11.1_mat_dominator_classes.png, > cassandra_3.11.1_mat_dominator_classes_FIXED.png, > cassandra_3.11.1_snapshot_heaputilization.png > > > Different to CASSANDRA-13754, there seems to be another memory leak in > 3.11.0+ in BTree$Builder / io.netty.util.Recycler$Stack. > * heap utilization increase after upgrading to 3.11.0 => > cassandra_3.11.0_min_memory_utilization.jpg > * No difference after upgrading to 3.11.1 (snapshot build) => > cassandra_3.11.1_snapshot_heaputilization.png; thus most likely after fixing > CASSANDRA-13754, more visible now > * MAT shows io.netty.util.Recycler$Stack as top contributing class => > cassandra_3.11.1_mat_dominator_classes.png > * With -Xmx8G (CMS) and our load pattern, we have to do a rolling restart > after ~ 72 hours > Verified the following fix, namely explicitly unreferencing the > _recycleHandle_ member (making it non-final). In > _org.apache.cassandra.utils.btree.BTree.Builder.recycle()_ > {code} > public void recycle() > { > if (recycleHandle != null) > { > this.cleanup(); > builderRecycler.recycle(this, recycleHandle); > recycleHandle = null; // ADDED > } > } > {code} > Patched a single node in our loadtest cluster with this change and after ~ 10 > hours uptime, no sign of the previously offending class in MAT anymore => > cassandra_3.11.1_mat_dominator_classes_FIXED.png > Can' say if this has any other side effects etc., but I doubt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13929) BTree$Builder / io.netty.util.Recycler$Stack leaking memory
[ https://issues.apache.org/jira/browse/CASSANDRA-13929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348181#comment-16348181 ] Norman Maurer commented on CASSANDRA-13929: --- yeah 4.0.55.Final should have the "fix" as well: [https://github.com/netty/netty/commit/b386ee3eaf35abd5072992d626de6ae2ccadc6d9#diff-23eafd00fcd66829f8cce343b26c236a] That said maybe there are other issues. Would it be possible to share a heap-dump ? > BTree$Builder / io.netty.util.Recycler$Stack leaking memory > --- > > Key: CASSANDRA-13929 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13929 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Priority: Major > Fix For: 3.11.x > > Attachments: cassandra_3.11.0_min_memory_utilization.jpg, > cassandra_3.11.1_NORECYCLE_memory_utilization.jpg, > cassandra_3.11.1_mat_dominator_classes.png, > cassandra_3.11.1_mat_dominator_classes_FIXED.png, > cassandra_3.11.1_snapshot_heaputilization.png > > > Different to CASSANDRA-13754, there seems to be another memory leak in > 3.11.0+ in BTree$Builder / io.netty.util.Recycler$Stack. > * heap utilization increase after upgrading to 3.11.0 => > cassandra_3.11.0_min_memory_utilization.jpg > * No difference after upgrading to 3.11.1 (snapshot build) => > cassandra_3.11.1_snapshot_heaputilization.png; thus most likely after fixing > CASSANDRA-13754, more visible now > * MAT shows io.netty.util.Recycler$Stack as top contributing class => > cassandra_3.11.1_mat_dominator_classes.png > * With -Xmx8G (CMS) and our load pattern, we have to do a rolling restart > after ~ 72 hours > Verified the following fix, namely explicitly unreferencing the > _recycleHandle_ member (making it non-final). In > _org.apache.cassandra.utils.btree.BTree.Builder.recycle()_ > {code} > public void recycle() > { > if (recycleHandle != null) > { > this.cleanup(); > builderRecycler.recycle(this, recycleHandle); > recycleHandle = null; // ADDED > } > } > {code} > Patched a single node in our loadtest cluster with this change and after ~ 10 > hours uptime, no sign of the previously offending class in MAT anymore => > cassandra_3.11.1_mat_dominator_classes_FIXED.png > Can' say if this has any other side effects etc., but I doubt. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
[ https://issues.apache.org/jira/browse/CASSANDRA-13789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16183709#comment-16183709 ] Norman Maurer commented on CASSANDRA-13789: --- [~jasobrown] [~Stefania] my exception was that caller of this methods ensure that there will be no "concurrent" access to the underlying object and so duplicate it if necessary. If this is not true then yes this needs to be changed. Just let me know and I will take care (happy to write a patch). > Reduce memory copies and object creations when acting on ByteBufs > -- > > Key: CASSANDRA-13789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 > Project: Cassandra > Issue Type: Improvement >Reporter: Norman Maurer >Assignee: Norman Maurer > Fix For: 4.0 > > Attachments: > 0001-CBUtil.sizeOfLongString-encodes-String-to-byte-to-ca.patch, > 0001-Reduce-memory-copies-and-object-creations-when-actin.patch > > > There are multiple "low-hanging-fruits" when it comes to reduce memory copies > and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
[ https://issues.apache.org/jira/browse/CASSANDRA-13789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157554#comment-16157554 ] Norman Maurer commented on CASSANDRA-13789: --- [~jasobrown] Doh! I guess I need to buy the beer now > Reduce memory copies and object creations when acting on ByteBufs > -- > > Key: CASSANDRA-13789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 > Project: Cassandra > Issue Type: Improvement >Reporter: Norman Maurer >Assignee: Norman Maurer > Fix For: 4.0 > > Attachments: > 0001-CBUtil.sizeOfLongString-encodes-String-to-byte-to-ca.patch, > 0001-Reduce-memory-copies-and-object-creations-when-actin.patch > > > There are multiple "low-hanging-fruits" when it comes to reduce memory copies > and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
[ https://issues.apache.org/jira/browse/CASSANDRA-13789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-13789: -- Attachment: 0001-CBUtil.sizeOfLongString-encodes-String-to-byte-to-ca.patch And another one... > Reduce memory copies and object creations when acting on ByteBufs > -- > > Key: CASSANDRA-13789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 > Project: Cassandra > Issue Type: Improvement >Reporter: Norman Maurer >Assignee: Norman Maurer > Attachments: > 0001-CBUtil.sizeOfLongString-encodes-String-to-byte-to-ca.patch, > 0001-Reduce-memory-copies-and-object-creations-when-actin.patch > > > There are multiple "low-hanging-fruits" when it comes to reduce memory copies > and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
[ https://issues.apache.org/jira/browse/CASSANDRA-13789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139768#comment-16139768 ] Norman Maurer commented on CASSANDRA-13789: --- [~jasobrown] let me know if you need anything else. > Reduce memory copies and object creations when acting on ByteBufs > -- > > Key: CASSANDRA-13789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 > Project: Cassandra > Issue Type: Improvement >Reporter: Norman Maurer >Assignee: Norman Maurer > Attachments: > 0001-Reduce-memory-copies-and-object-creations-when-actin.patch > > > There are multiple "low-hanging-fruits" when it comes to reduce memory copies > and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
[ https://issues.apache.org/jira/browse/CASSANDRA-13789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-13789: -- Assignee: Norman Maurer Status: Patch Available (was: Open) > Reduce memory copies and object creations when acting on ByteBufs > -- > > Key: CASSANDRA-13789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 > Project: Cassandra > Issue Type: Improvement >Reporter: Norman Maurer >Assignee: Norman Maurer > Attachments: > 0001-Reduce-memory-copies-and-object-creations-when-actin.patch > > > There are multiple "low-hanging-fruits" when it comes to reduce memory copies > and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-13789) Reduce memory copies and object creations when acting on ByteBufs
Norman Maurer created CASSANDRA-13789: - Summary: Reduce memory copies and object creations when acting on ByteBufs Key: CASSANDRA-13789 URL: https://issues.apache.org/jira/browse/CASSANDRA-13789 Project: Cassandra Issue Type: Improvement Reporter: Norman Maurer Attachments: 0001-Reduce-memory-copies-and-object-creations-when-actin.patch There are multiple "low-hanging-fruits" when it comes to reduce memory copies and object allocations when acting on ByteBufs. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16134918#comment-16134918 ] Norman Maurer commented on CASSANDRA-13649: --- Its up to you guys which versions the patch should be applied to, just wanted to mention this fix is very low-risk in terms of Netty itself as it just move logic to an extra ChannelHandler. Thats all. > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski >Assignee: Norman Maurer > Labels: patch > Attachments: > 0001-CASSANDRA-13649-Ensure-all-exceptions-are-correctly-.patch, > test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16133438#comment-16133438 ] Norman Maurer commented on CASSANDRA-13651: --- [~iksaif] FYI we will merge the change to use timerfd today and so it will be part of the next Netty release. That said I think what you suggested (changing the Cassandra code) may make more sense in general. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary >Assignee: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of > {{Message::Flusher}} with
[jira] [Updated] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-13649: -- Attachment: 0001-CASSANDRA-13649-Ensure-all-exceptions-are-correctly-.patch This patch should fix the problem and should go in all active Cassandra trees. > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski >Assignee: Norman Maurer > Labels: patch > Attachments: > 0001-CASSANDRA-13649-Ensure-all-exceptions-are-correctly-.patch, > test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-13649: -- Labels: patch (was: ) Assignee: Norman Maurer Status: Patch Available (was: Open) To ensure we handle all exceptions in the netty pipeline that are produced by either the Channel itself or the ChannelHandlers in the pipeline we need to ensure the handlers that does so not uses the RequestThreadPoolExecutor as it not enforces strict ordering of the events per Channel. > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski >Assignee: Norman Maurer > Labels: patch > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131914#comment-16131914 ] Norman Maurer commented on CASSANDRA-13649: --- Actually I think it is because of how you do the event dispatching in Cassandra... working on a patch, stay tuned. > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125612#comment-16125612 ] Norman Maurer commented on CASSANDRA-13649: --- And this only happens with the native epoll transport but not with the nio transport ? > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119785#comment-16119785 ] Norman Maurer commented on CASSANDRA-13649: --- Is it possible that you have an eventExecutor set here?: https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Server.java#L329 And if so can you show me the implementation of it ? > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119753#comment-16119753 ] Norman Maurer commented on CASSANDRA-13649: --- [~spo...@gmail.com] yes how handler(...) and childHandler(...) are now handled is more consistent. Can you give me a link to the code where you setup your handlers that this exceptions produce ? > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13649) Uncaught exceptions in Netty pipeline
[ https://issues.apache.org/jira/browse/CASSANDRA-13649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114698#comment-16114698 ] Norman Maurer commented on CASSANDRA-13649: --- [~spo...@gmail.com] sorry I am a bit busy atm but will check over the next days and come back to you. > Uncaught exceptions in Netty pipeline > - > > Key: CASSANDRA-13649 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13649 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging, Testing >Reporter: Stefan Podkowinski > Attachments: test_stdout.txt > > > I've noticed some netty related errors in trunk in [some of the dtest > results|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/106/#showFailuresLink]. > Just want to make sure that we don't have to change anything related to the > exception handling in our pipeline and that this isn't a netty issue. > Actually if this causes flakiness but is otherwise harmless, we should do > something about it, even if it's just on the dtest side. > {noformat} > WARN [epollEventLoopGroup-2-9] 2017-06-28 17:23:49,699 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > And again in another test: > {noformat} > WARN [epollEventLoopGroup-2-8] 2017-06-29 02:27:31,300 Slf4JLogger.java:151 > - An exceptionCaught() event was fired, and it reached at the tail of the > pipeline. It usually means the last handler in the pipeline did not handle > the exception. > io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: > Connection reset by peer > at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown > Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > {noformat} > Edit: > The {{io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() > failed}} error also causes tests to fail for 3.0 and 3.11. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16111027#comment-16111027 ] Norman Maurer commented on CASSANDRA-13651: --- Also for the record Scott Mitchell (another core netty dev) just created a PR in netty to support micro seconds timeouts when using the native epoll transport: https://github.com/netty/netty/pull/7042 That said I still need to review it in detail and its not merged yet. > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >0x7fca60459804 >epoll_wait > {code} > That looks like a lot of CPU used to wait for nothing. I'm not sure if pref > reports a per-CPU percentage or a per-system percentage, but that would be > still be 10% of the total CPU usage of Cassandra at the minimum. > I went further and found the code of all that: We schedule a lot of >
[jira] [Commented] (CASSANDRA-13651) Large amount of CPU used by epoll_wait(.., .., .., 0)
[ https://issues.apache.org/jira/browse/CASSANDRA-13651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110335#comment-16110335 ] Norman Maurer commented on CASSANDRA-13651: --- Sorry for the late response.. didn't see the mention :( So yes netty uses `epoll_wait` which only supports milli-seconds resulution so everything smaller then this will just cause `epoll_wait(...)` be called with a `0` and so a non-blocking check of ready fds. What we could do in our native transport implementation is that we make use of `timerfd` [1] to schedule timeouts but again this would only work for the case of using the native epoll transport and not when you use the nio transport (which works on all OS). So I think what you really want to do is have timeouts of >= 1ms. Comments and ideas welcome :) > Large amount of CPU used by epoll_wait(.., .., .., 0) > - > > Key: CASSANDRA-13651 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13651 > Project: Cassandra > Issue Type: Bug >Reporter: Corentin Chary > Fix For: 4.x > > Attachments: cpu-usage.png > > > I was trying to profile Cassandra under my workload and I kept seeing this > backtrace: > {code} > epollEventLoopGroup-2-3 State: RUNNABLE CPU usage on sample: 240ms > io.netty.channel.epoll.Native.epollWait0(int, long, int, int) Native.java > (native) > io.netty.channel.epoll.Native.epollWait(int, EpollEventArray, int) > Native.java:111 > io.netty.channel.epoll.EpollEventLoop.epollWait(boolean) > EpollEventLoop.java:230 > io.netty.channel.epoll.EpollEventLoop.run() EpollEventLoop.java:254 > io.netty.util.concurrent.SingleThreadEventExecutor$5.run() > SingleThreadEventExecutor.java:858 > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run() > DefaultThreadFactory.java:138 > java.lang.Thread.run() Thread.java:745 > {code} > At fist I though that the profiler might not be able to profile native code > properly, but I wen't further and I realized that most of the CPU was used by > {{epoll_wait()}} calls with a timeout of zero. > Here is the output of perf on this system, which confirms that most of the > overhead was with timeout == 0. > {code} > Samples: 11M of event 'syscalls:sys_enter_epoll_wait', Event count (approx.): > 11594448 > Overhead Trace output > > ◆ > 90.06% epfd: 0x0047, events: 0x7f5588c0c000, maxevents: 0x2000, > timeout: 0x > ▒ >5.77% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x > ▒ >1.98% epfd: 0x00b5, events: 0x7fca419ef000, maxevents: 0x1000, > timeout: 0x03e8 > ▒ >0.04% epfd: 0x0003, events: 0x2f6af77b9c00, maxevents: 0x0020, > timeout: 0x > ▒ >0.04% epfd: 0x002b, events: 0x121ebf63ac00, maxevents: 0x0040, > timeout: 0x > ▒ >0.03% epfd: 0x0026, events: 0x7f51f80019c0, maxevents: 0x0020, > timeout: 0x > ▒ >0.02% epfd: 0x0003, events: 0x7fe4d80019d0, maxevents: 0x0020, > timeout: 0x > {code} > Running this time with perf record -ag for call traces: > {code} > # Children Self sys usr Trace output > > # > > # > 8.61% 8.61% 0.00% 8.61% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x > | > ---0x1000200af313 >| > --8.61%--0x7fca6117bdac > 0x7fca60459804 > epoll_wait > 2.98% 2.98% 0.00% 2.98% epfd: 0x00a7, events: > 0x7fca452d6000, maxevents: 0x1000, timeout: 0x03e8 > | > ---0x1000200af313 >0x7fca6117b830 >
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943172#comment-15943172 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] let me know if I should review the code again. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863760#comment-15863760 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] whooot :) I will do another review as well, just to double-check again. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-13114) 3.0.x: update netty
[ https://issues.apache.org/jira/browse/CASSANDRA-13114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15838465#comment-15838465 ] Norman Maurer commented on CASSANDRA-13114: --- [~spo...@gmail.com] nope just upgrade to 4.0.43. FTW ;) That said we will release 4.0.44.Final end of this week or next week latest. > 3.0.x: update netty > --- > > Key: CASSANDRA-13114 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13114 > Project: Cassandra > Issue Type: Bug >Reporter: Tom van der Woerdt > Attachments: 13114_netty-4.0.43_2.x-3.0.patch, > 13114_netty-4.0.43_3.11.patch > > > https://issues.apache.org/jira/browse/CASSANDRA-12032 updated netty for > Cassandra 3.8, but this wasn't backported. Netty 4.0.23, which ships with > Cassandra 3.0.x, has some serious bugs around memory handling for SSL > connections. > It would be nice if both were updated to 4.0.42, a version released this year. > 4.0.23 makes it impossible for me to run SSL, because nodes run out of memory > every ~30 minutes. This was fixed in 4.0.27. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10993) Make read and write requests paths fully non-blocking, eliminate related stages
[ https://issues.apache.org/jira/browse/CASSANDRA-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370184#comment-15370184 ] Norman Maurer commented on CASSANDRA-10993: --- [~thobbs] so all good ? If not just ping me... > Make read and write requests paths fully non-blocking, eliminate related > stages > --- > > Key: CASSANDRA-10993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10993 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination, Local Write-Read Paths >Reporter: Aleksey Yeschenko >Assignee: Tyler Hobbs > Fix For: 3.x > > > Building on work done by [~tjake] (CASSANDRA-10528), [~slebresne] > (CASSANDRA-5239), and others, convert read and write request paths to be > fully non-blocking, to enable the eventual transition from SEDA to TPC > (CASSANDRA-10989) > Eliminate {{MUTATION}}, {{COUNTER_MUTATION}}, {{VIEW_MUTATION}}, {{READ}}, > and {{READ_REPAIR}} stages, move read and write execution directly to Netty > context. > For lack of decent async I/O options on Linux, we’ll still have to retain an > extra thread pool for serving read requests for data not residing in our page > cache (CASSANDRA-5863), however. > Implementation-wise, we only have two options available to us: explicit FSMs > and chained futures. Fibers would be the third, and easiest option, but > aren’t feasible in Java without resorting to direct bytecode manipulation > (ourselves or using [quasar|https://github.com/puniverse/quasar]). > I have seen 4 implementations bases on chained futures/promises now - three > in Java and one in C++ - and I’m not convinced that it’s the optimal (or > sane) choice for representing our complex logic - think 2i quorum read > requests with timeouts at all levels, read repair (blocking and > non-blocking), and speculative retries in the mix, {{SERIAL}} reads and > writes. > I’m currently leaning towards an implementation based on explicit FSMs, and > intend to provide a prototype - soonish - for comparison with > {{CompletableFuture}}-like variants. > Either way the transition is a relatively boring straightforward refactoring. > There are, however, some extension points on both write and read paths that > we do not control: > - authorisation implementations will have to be non-blocking. We have control > over built-in ones, but for any custom implementation we will have to execute > them in a separate thread pool > - 2i hooks on the write path will need to be non-blocking > - any trigger implementations will not be allowed to block > - UDFs and UDAs > We are further limited by API compatibility restrictions in the 3.x line, > forbidding us to alter, or add any non-{{default}} interface methods to those > extension points, so these pose a problem. > Depending on logistics, expecting to get this done in time for 3.4 or 3.6 > feature release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370179#comment-15370179 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] I did a first review-round and left some comments. All in all it looks solid from a Netty perspective, can't say a lot about cassandra perspective here :) Let me know if you have any questions etc. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10993) Make read and write requests paths fully non-blocking, eliminate related stages
[ https://issues.apache.org/jira/browse/CASSANDRA-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368142#comment-15368142 ] Norman Maurer commented on CASSANDRA-10993: --- [~thobbs] can you explain me why you can not just add the tasks to nettys EventLoop directly ? Is it because you not want to wake it up but just run these once the EventLoop run ? > Make read and write requests paths fully non-blocking, eliminate related > stages > --- > > Key: CASSANDRA-10993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10993 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination, Local Write-Read Paths >Reporter: Aleksey Yeschenko >Assignee: Tyler Hobbs > Fix For: 3.x > > > Building on work done by [~tjake] (CASSANDRA-10528), [~slebresne] > (CASSANDRA-5239), and others, convert read and write request paths to be > fully non-blocking, to enable the eventual transition from SEDA to TPC > (CASSANDRA-10989) > Eliminate {{MUTATION}}, {{COUNTER_MUTATION}}, {{VIEW_MUTATION}}, {{READ}}, > and {{READ_REPAIR}} stages, move read and write execution directly to Netty > context. > For lack of decent async I/O options on Linux, we’ll still have to retain an > extra thread pool for serving read requests for data not residing in our page > cache (CASSANDRA-5863), however. > Implementation-wise, we only have two options available to us: explicit FSMs > and chained futures. Fibers would be the third, and easiest option, but > aren’t feasible in Java without resorting to direct bytecode manipulation > (ourselves or using [quasar|https://github.com/puniverse/quasar]). > I have seen 4 implementations bases on chained futures/promises now - three > in Java and one in C++ - and I’m not convinced that it’s the optimal (or > sane) choice for representing our complex logic - think 2i quorum read > requests with timeouts at all levels, read repair (blocking and > non-blocking), and speculative retries in the mix, {{SERIAL}} reads and > writes. > I’m currently leaning towards an implementation based on explicit FSMs, and > intend to provide a prototype - soonish - for comparison with > {{CompletableFuture}}-like variants. > Either way the transition is a relatively boring straightforward refactoring. > There are, however, some extension points on both write and read paths that > we do not control: > - authorisation implementations will have to be non-blocking. We have control > over built-in ones, but for any custom implementation we will have to execute > them in a separate thread pool > - 2i hooks on the write path will need to be non-blocking > - any trigger implementations will not be allowed to block > - UDFs and UDAs > We are further limited by API compatibility restrictions in the 3.x line, > forbidding us to alter, or add any non-{{default}} interface methods to those > extension points, so these pose a problem. > Depending on logistics, expecting to get this done in time for 3.4 or 3.6 > feature release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10993) Make read and write requests paths fully non-blocking, eliminate related stages
[ https://issues.apache.org/jira/browse/CASSANDRA-10993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15366364#comment-15366364 ] Norman Maurer commented on CASSANDRA-10993: --- [~thobbs] basically we made it final to guard ourselves from users that will depend on some methods that we may want to remove later on. Can you explain me a bit what you try to do or show some code to better understand the use case ? > Make read and write requests paths fully non-blocking, eliminate related > stages > --- > > Key: CASSANDRA-10993 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10993 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination, Local Write-Read Paths >Reporter: Aleksey Yeschenko >Assignee: Tyler Hobbs > Fix For: 3.x > > > Building on work done by [~tjake] (CASSANDRA-10528), [~slebresne] > (CASSANDRA-5239), and others, convert read and write request paths to be > fully non-blocking, to enable the eventual transition from SEDA to TPC > (CASSANDRA-10989) > Eliminate {{MUTATION}}, {{COUNTER_MUTATION}}, {{VIEW_MUTATION}}, {{READ}}, > and {{READ_REPAIR}} stages, move read and write execution directly to Netty > context. > For lack of decent async I/O options on Linux, we’ll still have to retain an > extra thread pool for serving read requests for data not residing in our page > cache (CASSANDRA-5863), however. > Implementation-wise, we only have two options available to us: explicit FSMs > and chained futures. Fibers would be the third, and easiest option, but > aren’t feasible in Java without resorting to direct bytecode manipulation > (ourselves or using [quasar|https://github.com/puniverse/quasar]). > I have seen 4 implementations bases on chained futures/promises now - three > in Java and one in C++ - and I’m not convinced that it’s the optimal (or > sane) choice for representing our complex logic - think 2i quorum read > requests with timeouts at all levels, read repair (blocking and > non-blocking), and speculative retries in the mix, {{SERIAL}} reads and > writes. > I’m currently leaning towards an implementation based on explicit FSMs, and > intend to provide a prototype - soonish - for comparison with > {{CompletableFuture}}-like variants. > Either way the transition is a relatively boring straightforward refactoring. > There are, however, some extension points on both write and read paths that > we do not control: > - authorisation implementations will have to be non-blocking. We have control > over built-in ones, but for any custom implementation we will have to execute > them in a separate thread pool > - 2i hooks on the write path will need to be non-blocking > - any trigger implementations will not be allowed to block > - UDFs and UDAs > We are further limited by API compatibility restrictions in the 3.x line, > forbidding us to alter, or add any non-{{default}} interface methods to those > extension points, so these pose a problem. > Depending on logistics, expecting to get this done in time for 3.4 or 3.6 > feature release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10735) Support netty openssl (netty-tcnative) for client encryption
[ https://issues.apache.org/jira/browse/CASSANDRA-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15339194#comment-15339194 ] Norman Maurer commented on CASSANDRA-10735: --- Sorry for the delay but the good news is that I think I have everything needed here locally implemented now... Stay tuned for have everything needed merged into Netty. Once in I will look into add it to cassandra itself Performance FTW ;) > Support netty openssl (netty-tcnative) for client encryption > > > Key: CASSANDRA-10735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10735 > Project: Cassandra > Issue Type: Improvement >Reporter: Andy Tolbert >Assignee: Norman Maurer > Fix For: 3.x > > Attachments: netty-ssl-trunk.tgz, nettyssl-bench.tgz, > nettysslbench.png, nettysslbench_small.png, sslbench12-03.png > > > The java-driver recently added support for using netty openssl via > [netty-tcnative|http://netty.io/wiki/forked-tomcat-native.html] in > [JAVA-841|https://datastax-oss.atlassian.net/browse/JAVA-841], this shows a > very measured improvement (numbers incoming on that ticket). It seems > likely that this can offer improvement if implemented C* side as well. > Since netty-tcnative has platform specific requirements, this should not be > made the default, but rather be an option that one can use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11921) Upgrade to Netty 4.1 + PR5314
[ https://issues.apache.org/jira/browse/CASSANDRA-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15306538#comment-15306538 ] Norman Maurer commented on CASSANDRA-11921: --- [~snazy] no... the only thing to consider is that 4.1.x is using the PooledByteBufAllocator by default while 4.0.x is not. Also something as a side-note. I will merge the PR into 4.0 as well > Upgrade to Netty 4.1 + PR5314 > - > > Key: CASSANDRA-11921 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11921 > Project: Cassandra > Issue Type: Improvement >Reporter: Robert Stupp >Assignee: Robert Stupp >Priority: Minor > Fix For: 3.x > > > Netty [PR5314|https://github.com/netty/netty/pull/5314] works around > {{Bits.reserveMemory}}+{{Cleaner}} and introduces an independent off-heap > memory pool. > Requirement for CASSANDRA-11870 > Local tests of Netty4.1+PR5314 against trunk were running fine. > Any incompatibilities or else to consider when upgrading from Netty 4.0 to > 4.1? > /cc [~norman] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304464#comment-15304464 ] Norman Maurer edited comment on CASSANDRA-11818 at 5/27/16 6:14 PM: [~snazy] Sorry for the late response (busy as always :( )... I wonder if this would be something that may be helpful for you in terms of Netty: https://github.com/netty/netty/pull/5314 was (Author: norman): Sorry for the late response (busy as always :( )... I wonder if this would be something that may be helpful for you in terms of Netty: https://github.com/netty/netty/pull/5314 > C* does neither recover nor trigger stability inspector on direct memory OOM > > > Key: CASSANDRA-11818 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11818 > Project: Cassandra > Issue Type: Bug >Reporter: Robert Stupp > Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, > oom-histo-live.txt, oom-stack.txt > > > The following stack trace is not caught by {{JVMStabilityInspector}}. > Situation was caused by a load test with a lot of parallel writes and reads > against a single node. > {code} > ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - > Unexpected exception during request; channel = [id: 0x1e02351b, > L:/127.0.0.1:9042 - R:/127.0.0.1:51087] > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.8.0_92] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > ~[na:1.8.0_92] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349) > ~[main/:na] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314) > ~[main/:na] > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445) > ~[main/:na] > at > io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92] > {code} > The situation does not get better when the load driver is stopped. > I can reproduce this scenario at will. Managed to get histogram, stack traces > and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM
[ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15304464#comment-15304464 ] Norman Maurer commented on CASSANDRA-11818: --- Sorry for the late response (busy as always :( )... I wonder if this would be something that may be helpful for you in terms of Netty: https://github.com/netty/netty/pull/5314 > C* does neither recover nor trigger stability inspector on direct memory OOM > > > Key: CASSANDRA-11818 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11818 > Project: Cassandra > Issue Type: Bug >Reporter: Robert Stupp > Attachments: 11818-direct-mem-unpooled.png, 11818-direct-mem.png, > oom-histo-live.txt, oom-stack.txt > > > The following stack trace is not caught by {{JVMStabilityInspector}}. > Situation was caused by a load test with a lot of parallel writes and reads > against a single node. > {code} > ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - > Unexpected exception during request; channel = [id: 0x1e02351b, > L:/127.0.0.1:9042 - R:/127.0.0.1:51087] > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92] > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > ~[na:1.8.0_92] > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) > ~[na:1.8.0_92] > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349) > ~[main/:na] > at > org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314) > ~[main/:na] > at > io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445) > ~[main/:na] > at > io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > ~[netty-all-4.0.36.Final.jar:4.0.36.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92] > {code} > The situation does not get better when the load driver is stopped. > I can reproduce this scenario at will. Managed to get histogram, stack traces > and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}. > A {{nodetool flush}} causes the daemon to exit (as that direct-memory OOM is > caught by {{JVMStabilityInspector}}). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11749) CQLSH gets SSL exception following a COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15303499#comment-15303499 ] Norman Maurer commented on CASSANDRA-11749: --- [~Stefania] no problems... Sorry that it took me so long :( > CQLSH gets SSL exception following a COPY FROM > -- > > Key: CASSANDRA-11749 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11749 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Stefania >Assignee: Stefania > Fix For: 2.1.x > > Attachments: driver_debug.txt, stdout.txt.zip, > stdout_single_process.txt.zip > > > When running Cassandra and cqlsh with SSL, the following command occasionally > results in the exception below: > {code} > cqlsh --ssl -f kv.cql > {code} > {code} > ERROR [SharedPool-Worker-2] 2016-05-11 12:41:03,583 Message.java:538 - > Unexpected exception during request; channel = [id: 0xeb75e05d, > /127.0.0.1:51083 => /127.0.0.1:9042] > io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: bad > record MAC > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:280) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:722) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: javax.net.ssl.SSLException: bad record MAC > at sun.security.ssl.Alerts.getSSLException(Alerts.java:208) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:981) > ~[na:1.8.0_91] > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > ~[na:1.8.0_91] > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[na:1.8.0_91] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:982) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:908) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:854) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > ... 10 common frames omitted > Caused by: javax.crypto.BadPaddingException: bad record MAC > at sun.security.ssl.InputRecord.decrypt(InputRecord.java:219) > ~[na:1.8.0_91] > at > sun.security.ssl.EngineInputRecord.decrypt(EngineInputRecord.java:177) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:974) > ~[na:1.8.0_91] > ... 17 common frames omitted > {code} > where > {code} > cat kv.cql > create keyspace if not exists cvs_copy_ks with replication = {'class': > 'SimpleStrategy', 'replication_factor':1}; > create table if not exists cvs_copy_ks.kv (key int primary key, value text); > truncate cvs_copy_ks.kv; > copy cvs_copy_ks.kv (key, value) from 'kv.csv' with header='true'; > select * from cvs_copy_ks.kv; > drop keyspace cvs_copy_ks; > stefi@cuoricina:~/git/cstar/cassandra$ cat kv.c > kv.cql kv.csv > cat kv.csv > key,value > 1,'a' > 2,'b' > 3,'c' > {code} > The COPY FROM succeeds, however the following select does not. > The easiest way to
[jira] [Commented] (CASSANDRA-11749) CQLSH gets SSL exception following a COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302311#comment-15302311 ] Norman Maurer commented on CASSANDRA-11749: --- [~jjordan] Only very briefly but my suspection so far is that it is actually a race in cqlsh where two multiple threads write to the same connection concurrently. Like one writes and the second starts to write as well before the first is complete. Could this be possible ? This would also explain why a sleep may "workaround" this. > CQLSH gets SSL exception following a COPY FROM > -- > > Key: CASSANDRA-11749 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11749 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Stefania >Assignee: Stefania > Fix For: 2.1.x > > Attachments: stdout.txt.zip, stdout_single_process.txt.zip > > > When running Cassandra and cqlsh with SSL, the following command occasionally > results in the exception below: > {code} > cqlsh --ssl -f kv.cql > {code} > {code} > ERROR [SharedPool-Worker-2] 2016-05-11 12:41:03,583 Message.java:538 - > Unexpected exception during request; channel = [id: 0xeb75e05d, > /127.0.0.1:51083 => /127.0.0.1:9042] > io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: bad > record MAC > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:280) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:722) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: javax.net.ssl.SSLException: bad record MAC > at sun.security.ssl.Alerts.getSSLException(Alerts.java:208) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:981) > ~[na:1.8.0_91] > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > ~[na:1.8.0_91] > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[na:1.8.0_91] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:982) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:908) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:854) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > ... 10 common frames omitted > Caused by: javax.crypto.BadPaddingException: bad record MAC > at sun.security.ssl.InputRecord.decrypt(InputRecord.java:219) > ~[na:1.8.0_91] > at > sun.security.ssl.EngineInputRecord.decrypt(EngineInputRecord.java:177) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:974) > ~[na:1.8.0_91] > ... 17 common frames omitted > {code} > where > {code} > cat kv.cql > create keyspace if not exists cvs_copy_ks with replication = {'class': > 'SimpleStrategy', 'replication_factor':1}; > create table if not exists cvs_copy_ks.kv (key int primary key, value text); > truncate cvs_copy_ks.kv; > copy cvs_copy_ks.kv (key, value) from 'kv.csv' with header='true'; > select * from
[jira] [Commented] (CASSANDRA-11749) CQLSH gets SSL exception following a COPY FROM
[ https://issues.apache.org/jira/browse/CASSANDRA-11749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15282697#comment-15282697 ] Norman Maurer commented on CASSANDRA-11749: --- [~Stefania] if you can give me "step-by-step" way to reproduce this in my laptop I will have a look and try to figure out what is wrong > CQLSH gets SSL exception following a COPY FROM > -- > > Key: CASSANDRA-11749 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11749 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Stefania >Assignee: Stefania > Fix For: 2.1.x > > Attachments: stdout.txt.zip, stdout_single_process.txt.zip > > > When running Cassandra and cqlsh with SSL, the following command occasionally > results in the exception below: > {code} > cqlsh --ssl -f kv.cql > {code} > {code} > ERROR [SharedPool-Worker-2] 2016-05-11 12:41:03,583 Message.java:538 - > Unexpected exception during request; channel = [id: 0xeb75e05d, > /127.0.0.1:51083 => /127.0.0.1:9042] > io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException: bad > record MAC > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:280) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.epollInReady(EpollSocketChannel.java:722) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: javax.net.ssl.SSLException: bad record MAC > at sun.security.ssl.Alerts.getSSLException(Alerts.java:208) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.fatal(SSLEngineImpl.java:1728) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:981) > ~[na:1.8.0_91] > at > sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781) > ~[na:1.8.0_91] > at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624) ~[na:1.8.0_91] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:982) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:908) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:854) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > at > io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) > ~[netty-all-4.0.23.Final.jar:4.0.23.Final] > ... 10 common frames omitted > Caused by: javax.crypto.BadPaddingException: bad record MAC > at sun.security.ssl.InputRecord.decrypt(InputRecord.java:219) > ~[na:1.8.0_91] > at > sun.security.ssl.EngineInputRecord.decrypt(EngineInputRecord.java:177) > ~[na:1.8.0_91] > at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:974) > ~[na:1.8.0_91] > ... 17 common frames omitted > {code} > where > {code} > cat kv.cql > create keyspace if not exists cvs_copy_ks with replication = {'class': > 'SimpleStrategy', 'replication_factor':1}; > create table if not exists cvs_copy_ks.kv (key int primary key, value text); > truncate cvs_copy_ks.kv; > copy cvs_copy_ks.kv (key, value) from 'kv.csv' with header='true'; > select * from cvs_copy_ks.kv; > drop keyspace cvs_copy_ks; > stefi@cuoricina:~/git/cstar/cassandra$ cat kv.c > kv.cql kv.csv > cat kv.csv > key,value > 1,'a' > 2,'b' > 3,'c' > {code} > The COPY FROM succeeds, however
[jira] [Commented] (CASSANDRA-10735) Support netty openssl (netty-tcnative) for client encryption
[ https://issues.apache.org/jira/browse/CASSANDRA-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220135#comment-15220135 ] Norman Maurer commented on CASSANDRA-10735: --- [~iamaleksey] you are fast buddy ;) > Support netty openssl (netty-tcnative) for client encryption > > > Key: CASSANDRA-10735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10735 > Project: Cassandra > Issue Type: Improvement >Reporter: Andy Tolbert >Assignee: Norman Maurer > Fix For: 3.x > > Attachments: netty-ssl-trunk.tgz, nettyssl-bench.tgz, > nettysslbench.png, nettysslbench_small.png, sslbench12-03.png > > > The java-driver recently added support for using netty openssl via > [netty-tcnative|http://netty.io/wiki/forked-tomcat-native.html] in > [JAVA-841|https://datastax-oss.atlassian.net/browse/JAVA-841], this shows a > very measured improvement (numbers incoming on that ticket). It seems > likely that this can offer improvement if implemented C* side as well. > Since netty-tcnative has platform specific requirements, this should not be > made the default, but rather be an option that one can use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10735) Support netty openssl (netty-tcnative) for client encryption
[ https://issues.apache.org/jira/browse/CASSANDRA-10735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220125#comment-15220125 ] Norman Maurer commented on CASSANDRA-10735: --- Actually you may want to assign this to me ? ;) > Support netty openssl (netty-tcnative) for client encryption > > > Key: CASSANDRA-10735 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10735 > Project: Cassandra > Issue Type: Improvement >Reporter: Andy Tolbert >Assignee: Aleksey Yeschenko > Fix For: 3.x > > Attachments: netty-ssl-trunk.tgz, nettyssl-bench.tgz, > nettysslbench.png, nettysslbench_small.png, sslbench12-03.png > > > The java-driver recently added support for using netty openssl via > [netty-tcnative|http://netty.io/wiki/forked-tomcat-native.html] in > [JAVA-841|https://datastax-oss.atlassian.net/browse/JAVA-841], this shows a > very measured improvement (numbers incoming on that ticket). It seems > likely that this can offer improvement if implemented C* side as well. > Since netty-tcnative has platform specific requirements, this should not be > made the default, but rather be an option that one can use. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141370#comment-15141370 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] you know where to find me ;) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 3.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11047) native protocol will not bind ipv6
[ https://issues.apache.org/jira/browse/CASSANDRA-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15124053#comment-15124053 ] Norman Maurer commented on CASSANDRA-11047: --- [~brandon.williams] netty 4.0.34.Final was released which has a fix for it. So I think it's up to you guys now to upgrade :) > native protocol will not bind ipv6 > -- > > Key: CASSANDRA-11047 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11047 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Brandon Williams >Assignee: Norman Maurer > Fix For: 2.1.x, 2.2.x, 3.x > > > When you set rpc_address to 0.0.0.0 it should bind every interface. Of > course for ipv6 you have to comment out -Djava.net.preferIPv4Stack=true from > cassandra-env.sh, however this will not make the native protocol bind on > ipv6, only thrift: > {noformat} > tcp6 0 0 :::9160 :::*LISTEN > 13488/java > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 13488/java > # telnet ::1 9160 > Trying ::1... > Connected to ::1. > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > # telnet ::1 9042 > Trying ::1... > telnet: Unable to connect to remote host: Connection refused > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11047) native protocol will not bind ipv6
[ https://issues.apache.org/jira/browse/CASSANDRA-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118835#comment-15118835 ] Norman Maurer commented on CASSANDRA-11047: --- [~brandon.williams] I just proposed a fix for netty : https://github.com/netty/netty/pull/4770 > native protocol will not bind ipv6 > -- > > Key: CASSANDRA-11047 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11047 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Brandon Williams >Assignee: Norman Maurer > Fix For: 2.1.x, 2.2.x, 3.x > > > When you set rpc_address to 0.0.0.0 it should bind every interface. Of > course for ipv6 you have to comment out -Djava.net.preferIPv4Stack=true from > cassandra-env.sh, however this will not make the native protocol bind on > ipv6, only thrift: > {noformat} > tcp6 0 0 :::9160 :::*LISTEN > 13488/java > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 13488/java > # telnet ::1 9160 > Trying ::1... > Connected to ::1. > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > # telnet ::1 9042 > Trying ::1... > telnet: Unable to connect to remote host: Connection refused > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11047) native protocol will not bind ipv6
[ https://issues.apache.org/jira/browse/CASSANDRA-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112652#comment-15112652 ] Norman Maurer commented on CASSANDRA-11047: --- will check next week. > native protocol will not bind ipv6 > -- > > Key: CASSANDRA-11047 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11047 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Brandon Williams >Assignee: Norman Maurer > Fix For: 2.1.x, 2.2.x, 3.x > > > When you set rpc_address to 0.0.0.0 it should bind every interface. Of > course for ipv6 you have to comment out -Djava.net.preferIPv4Stack=true from > cassandra-env.sh, however this will not make the native protocol bind on > ipv6, only thrift: > {noformat} > tcp6 0 0 :::9160 :::*LISTEN > 13488/java > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 13488/java > # telnet ::1 9160 > Trying ::1... > Connected to ::1. > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > # telnet ::1 9042 > Trying ::1... > telnet: Unable to connect to remote host: Connection refused > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11047) native protocol will not bind ipv6
[ https://issues.apache.org/jira/browse/CASSANDRA-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111030#comment-15111030 ] Norman Maurer commented on CASSANDRA-11047: --- [~brandon.williams] you are using the native (epoll) transport or just nio ? what Netty version ? > native protocol will not bind ipv6 > -- > > Key: CASSANDRA-11047 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11047 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Brandon Williams > Fix For: 2.1.x, 2.2.x, 3.x > > > When you set rpc_address to 0.0.0.0 it should bind every interface. Of > course for ipv6 you have to comment out -Djava.net.preferIPv4Stack=true from > cassandra-env.sh, however this will not make the native protocol bind on > ipv6, only thrift: > {noformat} > tcp6 0 0 :::9160 :::*LISTEN > 13488/java > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 13488/java > # telnet ::1 9160 > Trying ::1... > Connected to ::1. > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > # telnet ::1 9042 > Trying ::1... > telnet: Unable to connect to remote host: Connection refused > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (CASSANDRA-11047) native protocol will not bind ipv6
[ https://issues.apache.org/jira/browse/CASSANDRA-11047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer reassigned CASSANDRA-11047: - Assignee: Norman Maurer > native protocol will not bind ipv6 > -- > > Key: CASSANDRA-11047 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11047 > Project: Cassandra > Issue Type: Bug > Components: CQL >Reporter: Brandon Williams >Assignee: Norman Maurer > Fix For: 2.1.x, 2.2.x, 3.x > > > When you set rpc_address to 0.0.0.0 it should bind every interface. Of > course for ipv6 you have to comment out -Djava.net.preferIPv4Stack=true from > cassandra-env.sh, however this will not make the native protocol bind on > ipv6, only thrift: > {noformat} > tcp6 0 0 :::9160 :::*LISTEN > 13488/java > tcp6 0 0 0.0.0.0:9042:::*LISTEN > 13488/java > # telnet ::1 9160 > Trying ::1... > Connected to ::1. > Escape character is '^]'. > ^] > telnet> quit > Connection closed. > # telnet ::1 9042 > Trying ::1... > telnet: Unable to connect to remote host: Connection refused > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10559) Support encrypted and plain traffic on the same port
[ https://issues.apache.org/jira/browse/CASSANDRA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-10559: -- Attachment: 0001-CASSANDRA-8803-Allow-to-serve-plain-and-encrypted-2.1.patch > Support encrypted and plain traffic on the same port > > > Key: CASSANDRA-10559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10559 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Norman Maurer >Assignee: Norman Maurer > Fix For: 3.0.0 > > Attachments: > 0001-CASSANDRA-8803-Allow-to-serve-plain-and-encrypted-2.1.patch, > 0001-CASSANDRA-8803-Allow-to-serve-plain-and-encrypted-na.patch > > > To be able to migrate clusters in a rolling way from plain to encrypted > traffic it would be very helpful if we could have Cassandra accept both on > the same port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-10559) Support encrypted and plain traffic on the same port
Norman Maurer created CASSANDRA-10559: - Summary: Support encrypted and plain traffic on the same port Key: CASSANDRA-10559 URL: https://issues.apache.org/jira/browse/CASSANDRA-10559 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Norman Maurer Assignee: Norman Maurer To be able to migrate clusters in a rolling way from plain to encrypted traffic it would be very helpful if we could have Cassandra accept both on the same port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8803) Implement transitional mode in C* that will accept both encrypted and non-encrypted client traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14966487#comment-14966487 ] Norman Maurer commented on CASSANDRA-8803: -- [~brandon.williams] sorry for the delay... Here we go https://issues.apache.org/jira/browse/CASSANDRA-10559 > Implement transitional mode in C* that will accept both encrypted and > non-encrypted client traffic > -- > > Key: CASSANDRA-8803 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8803 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Vishy Kasar > > We have some non-secure clusters taking live traffic in production from > active clients. We want to enable client to node encryption on these > clusters. Once we set the client_encryption_options enabled to true in yaml > and bounce a cassandra node in the ring, the existing clients that do not do > SSL will fail to connect to that node. > There does not seem to be a good way to roll this change with out taking an > outage. Can we implement a transitional mode in C* that will accept both > encrypted and non-encrypted client traffic? We would enable this during > transition and turn it off after both server and client start talking SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-10559) Support encrypted and plain traffic on the same port
[ https://issues.apache.org/jira/browse/CASSANDRA-10559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-10559: -- Attachment: 0001-CASSANDRA-8803-Allow-to-serve-plain-and-encrypted-na.patch > Support encrypted and plain traffic on the same port > > > Key: CASSANDRA-10559 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10559 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Norman Maurer >Assignee: Norman Maurer > Attachments: > 0001-CASSANDRA-8803-Allow-to-serve-plain-and-encrypted-na.patch > > > To be able to migrate clusters in a rolling way from plain to encrypted > traffic it would be very helpful if we could have Cassandra accept both on > the same port. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8803) Implement transitional mode in C* that will accept both encrypted and non-encrypted client traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14907779#comment-14907779 ] Norman Maurer commented on CASSANDRA-8803: -- [~brandon.williams] against which branch I should do the patch ? > Implement transitional mode in C* that will accept both encrypted and > non-encrypted client traffic > -- > > Key: CASSANDRA-8803 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8803 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Vishy Kasar > > We have some non-secure clusters taking live traffic in production from > active clients. We want to enable client to node encryption on these > clusters. Once we set the client_encryption_options enabled to true in yaml > and bounce a cassandra node in the ring, the existing clients that do not do > SSL will fail to connect to that node. > There does not seem to be a good way to roll this change with out taking an > outage. Can we implement a transitional mode in C* that will accept both > encrypted and non-encrypted client traffic? We would enable this during > transition and turn it off after both server and client start talking SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8803) Implement transitional mode in C* that will accept both encrypted and non-encrypted client traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906156#comment-14906156 ] Norman Maurer commented on CASSANDRA-8803: -- [~brandon.williams] I have a patch here that I would like to submit to allow serve SSL and non SSL on the same port without the need for STARTTLS etc. This will make things a lot easier. Should I just reopen this issue and attach the patch here or what ? > Implement transitional mode in C* that will accept both encrypted and > non-encrypted client traffic > -- > > Key: CASSANDRA-8803 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8803 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Vishy Kasar > > We have some non-secure clusters taking live traffic in production from > active clients. We want to enable client to node encryption on these > clusters. Once we set the client_encryption_options enabled to true in yaml > and bounce a cassandra node in the ring, the existing clients that do not do > SSL will fail to connect to that node. > There does not seem to be a good way to roll this change with out taking an > outage. Can we implement a transitional mode in C* that will accept both > encrypted and non-encrypted client traffic? We would enable this during > transition and turn it off after both server and client start talking SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8803) Implement transitional mode in C* that will accept both encrypted and non-encrypted client traffic
[ https://issues.apache.org/jira/browse/CASSANDRA-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634665#comment-14634665 ] Norman Maurer commented on CASSANDRA-8803: -- [~brandon.williams] you can assign this to me if you like... I will work on a patch Implement transitional mode in C* that will accept both encrypted and non-encrypted client traffic -- Key: CASSANDRA-8803 URL: https://issues.apache.org/jira/browse/CASSANDRA-8803 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Vishy Kasar Fix For: 2.0.x We have some non-secure clusters taking live traffic in production from active clients. We want to enable client to node encryption on these clusters. Once we set the client_encryption_options enabled to true in yaml and bounce a cassandra node in the ring, the existing clients that do not do SSL will fail to connect to that node. There does not seem to be a good way to roll this change with out taking an outage. Can we implement a transitional mode in C* that will accept both encrypted and non-encrypted client traffic? We would enable this during transition and turn it off after both server and client start talking SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9558) Cassandra-stress regression in 2.2
[ https://issues.apache.org/jira/browse/CASSANDRA-9558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593780#comment-14593780 ] Norman Maurer commented on CASSANDRA-9558: -- Sorry for been late to the party, but it somehow got lost in my inbox :( So from a netty standpoint your are right flushing from outside the EventLoop is pretty expensive as it will need to wakeup the selector if it is not already woken up and processing stuff. So the best thing you can do is either always write / flush etc from within the EventLoop or try to minimize the flushes from outside the EventLoop. That said if you point me to the place in your code where you do the flush and the other stuff I'm happy to have a look and see if I can give you some idea how to improve. Just let me know! Cassandra-stress regression in 2.2 -- Key: CASSANDRA-9558 URL: https://issues.apache.org/jira/browse/CASSANDRA-9558 Project: Cassandra Issue Type: Bug Reporter: Alan Boudreault Fix For: 2.2.0 rc2 Attachments: 2.1.log, 2.2.log, CASSANDRA-9558-2.patch, CASSANDRA-9558-ProtocolV2.patch, atolber-CASSANDRA-9558-stress.tgz, atolber-trunk-driver-coalescing-disabled.txt, stress-2.1-java-driver-2.0.9.2.log, stress-2.1-java-driver-2.2+PATCH.log, stress-2.1-java-driver-2.2.log, stress-2.2-java-driver-2.2+PATCH.log, stress-2.2-java-driver-2.2.log We are seeing some regression in performance when using cassandra-stress 2.2. You can see the difference at this url: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=stress_regression.jsonmetric=op_rateoperation=1_writesmoothing=1show_aggregates=truexmin=0xmax=108.57ymin=0ymax=168147.1 The cassandra version of the cluster doesn't seem to have any impact. //cc [~tjake] [~benedict] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-2.1.patch Patch against 2.1 Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-2.1.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final-v2.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14347404#comment-14347404 ] Norman Maurer commented on CASSANDRA-8086: -- Will get you a patch asap. Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final-v2.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Comment: was deleted (was: you are right, sigh... fixing now) Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345293#comment-14345293 ] Norman Maurer commented on CASSANDRA-8086: -- you are right, sigh... fixing now Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345294#comment-14345294 ] Norman Maurer commented on CASSANDRA-8086: -- you are right, sigh... fixing now Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14345314#comment-14345314 ] Norman Maurer commented on CASSANDRA-8086: -- Addressed comment and uploaded new patch Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final-v2.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final-v2.patch Address comment... Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final-v2.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch This is the patch with comments addressed. It basically not add the handler to the pipeline if not explicit enabled. So it's kind of dead-code and so should expose no risk. Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-final.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325853#comment-14325853 ] Norman Maurer commented on CASSANDRA-8086: -- Fair enough... let me adjust the patch to only add the handler to the pipeline if enabled. Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.4 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch Latest patch with all to limit per source ip or limit in general Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.3 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch Patch with applies on the cassandra-2.0 branch Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Fix For: 2.1.3 Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c-2.0.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.patch, 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-6198: - Attachment: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con-v2.txt This includes changes to address comment of Brandon. Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Fix For: 2.1.3 Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con-v2.txt, 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237726#comment-14237726 ] Norman Maurer commented on CASSANDRA-6198: -- [~brandon.williams] sorry for the delay, just addressed your comment and uploaded a new version of the patch with the change included. Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Fix For: 2.1.3 Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con-v2.txt, 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226587#comment-14226587 ] Norman Maurer commented on CASSANDRA-6198: -- Agree Boolean.getBoolean(...) would be better. Should I adjust the patch ? Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Fix For: 2.1.3 Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-6198: - Attachment: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-6198: - Attachment: (was: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt) Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-6198: - Attachment: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-6198) Distinguish streaming traffic at network level
[ https://issues.apache.org/jira/browse/CASSANDRA-6198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14220968#comment-14220968 ] Norman Maurer commented on CASSANDRA-6198: -- Please review... Distinguish streaming traffic at network level -- Key: CASSANDRA-6198 URL: https://issues.apache.org/jira/browse/CASSANDRA-6198 Project: Cassandra Issue Type: Improvement Components: Core Reporter: sankalp kohli Assignee: Norman Maurer Priority: Minor Attachments: 0001-CASSANDRA-6198-Set-IPTOS_THROUGHPUT-on-streaming-con.txt It would be nice to have some information in the TCP packet which network teams can inspect to distinguish between streaming traffic and other organic cassandra traffic. This is very useful for monitoring WAN traffic. Here are some solutions: 1) Use a different port for streaming. 2) Add some IP header. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-8086) Cassandra should have ability to limit the number of native connections
[ https://issues.apache.org/jira/browse/CASSANDRA-8086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-8086: - Attachment: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt Please review... This is against 2.1 Cassandra should have ability to limit the number of native connections --- Key: CASSANDRA-8086 URL: https://issues.apache.org/jira/browse/CASSANDRA-8086 Project: Cassandra Issue Type: Bug Reporter: Vishy Kasar Assignee: Norman Maurer Attachments: 0001-CASSANDRA-8086-Allow-to-limit-the-number-of-native-c.txt We have a production cluster with 72 instances spread across 2 DCs. We have a large number ( ~ 40,000 ) of clients hitting this cluster. Client normally connects to 4 cassandra instances. Some event (we think it is a schema change on server side) triggered the client to establish connections to all cassandra instances of local DC. This brought the server to its knees. The client connections failed and client attempted re-connections. Cassandra should protect itself from such attack from client. Do we have any knobs to control the number of max connections? If not, we need to add that knob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14105211#comment-14105211 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] so no netty issue at all ? Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1 rc6 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN 10.240.72.183 896.57 KB 256 100.0% 15735c4d-98d4-4ea4-a305-7ab2d92f65fc RAC1 $ echo
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096625#comment-14096625 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] hmm.. it should always get returned to the pool that it was allocated from. Could you provide me with an easy way to reproduce ? Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096634#comment-14096634 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] Yeah it add to the cache of the releasing thread that is right.. I thought you talk about return to pool. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0% 41314169-eff5-465f-85ea-d501fd8f9c5e RAC1 UN 10.240.137.253 1.1 MB 256 100.0% c706f5f9-c5f3-4d5e-95e9-a8903823827e RAC1 UN 10.240.72.183 896.57 KB
[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test
[ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14096647#comment-14096647 ] Norman Maurer commented on CASSANDRA-7743: -- [~benedict] well it will be released after a while if not used. But I think for your use-case it would be best to disable the cache which can be done via the PooledByteBufAllocator constructor just pass in 0 for int tinyCacheSize, int smallCacheSize, int normalCacheSize. Possible C* OOM issue during long running test -- Key: CASSANDRA-7743 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743 Project: Cassandra Issue Type: Bug Components: Core Environment: Google Compute Engine, n1-standard-1 Reporter: Pierre Laporte Assignee: Benedict Fix For: 2.1.0 During a long running test, we ended up with a lot of java.lang.OutOfMemoryError: Direct buffer memory errors on the Cassandra instances. Here is an example of stacktrace from system.log : {code} ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25] at java.nio.DirectByteBuffer.init(DirectByteBuffer.java:123) ~[na:1.7.0_25] at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25] at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final] at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25] {code} The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test. After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file. The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available. {code} $ free -m total used free sharedbuffers cached Mem: 3702 3532169 0161854 -/+ buffers/cache: 2516 1185 Swap:0 0 0 $ head -n 4 /proc/meminfo MemTotal:3791292 kB MemFree: 173568 kB Buffers: 165608 kB Cached: 874752 kB {code} These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh : {code} $ ./bin/nodetool --host 10.240.137.253 status duration_test Datacenter: DC1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.240.98.27925.17 KB 256 100.0%
[jira] [Commented] (CASSANDRA-7695) Inserting the same row in parallel causes bad data to be returned to the client
[ https://issues.apache.org/jira/browse/CASSANDRA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095316#comment-14095316 ] Norman Maurer commented on CASSANDRA-7695: -- Hey guys I finally found the root cause of the problem and fixed it in Netty. That said I think it is also possible to see the same problem when using non-unsafe ByteBufs (if you are lucky enough). This problem was easier to reproduce on OSX as for this to happen you need to produce some series of incomplete / complete writes to trigger that, and this is easier on OSX stock network configuration. The issue and fix can be found here: https://github.com/netty/netty/issues/2761 Inserting the same row in parallel causes bad data to be returned to the client --- Key: CASSANDRA-7695 URL: https://issues.apache.org/jira/browse/CASSANDRA-7695 Project: Cassandra Issue Type: Bug Environment: Linux 3.12.21, JVM 1.7u60 Cassandra server 2.1.0 RC 5 Cassandra datastax client version 2.1.0RC1 Reporter: Johan Bjork Assignee: T Jake Luciani Priority: Blocker Labels: qa-resolved Fix For: 2.1.0 Attachments: 7695-workaround.txt, PutFailureRepro.java, bad-data-tid43-get, bad-data-tid43-put Running the attached test program against a cassandra 2.1 server results in scrambled data returned by the SELECT statement. Running it against latest stable works fine. Attached: * Program that reproduces the failure * Example output files from mentioned test-program with the scrambled output. Failure mode: The value returned by 'get' is scrambled, the size is correct but some bytes have shifted locations in the returned buffer. Cluster info: For the test we set up a single cassandra node using the stock configuration file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (CASSANDRA-7695) Inserting the same row in parallel causes bad data to be returned to the client
[ https://issues.apache.org/jira/browse/CASSANDRA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Norman Maurer updated CASSANDRA-7695: - Attachment: 0001-CASSANDRA-7695-Workaround-Netty-bug-by-not-use-Compo.patch This diff will workaround the bug while still allow to use unsafe etc. Inserting the same row in parallel causes bad data to be returned to the client --- Key: CASSANDRA-7695 URL: https://issues.apache.org/jira/browse/CASSANDRA-7695 Project: Cassandra Issue Type: Bug Environment: Linux 3.12.21, JVM 1.7u60 Cassandra server 2.1.0 RC 5 Cassandra datastax client version 2.1.0RC1 Reporter: Johan Bjork Assignee: T Jake Luciani Priority: Blocker Labels: qa-resolved Fix For: 2.1.0 Attachments: 0001-CASSANDRA-7695-Workaround-Netty-bug-by-not-use-Compo.patch, 7695-workaround.txt, PutFailureRepro.java, bad-data-tid43-get, bad-data-tid43-put Running the attached test program against a cassandra 2.1 server results in scrambled data returned by the SELECT statement. Running it against latest stable works fine. Attached: * Program that reproduces the failure * Example output files from mentioned test-program with the scrambled output. Failure mode: The value returned by 'get' is scrambled, the size is correct but some bytes have shifted locations in the returned buffer. Cluster info: For the test we set up a single cassandra node using the stock configuration file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7695) Inserting the same row in parallel causes bad data to be returned to the client
[ https://issues.apache.org/jira/browse/CASSANDRA-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095526#comment-14095526 ] Norman Maurer commented on CASSANDRA-7695: -- [~tjake] yay :) Inserting the same row in parallel causes bad data to be returned to the client --- Key: CASSANDRA-7695 URL: https://issues.apache.org/jira/browse/CASSANDRA-7695 Project: Cassandra Issue Type: Bug Environment: Linux 3.12.21, JVM 1.7u60 Cassandra server 2.1.0 RC 5 Cassandra datastax client version 2.1.0RC1 Reporter: Johan Bjork Assignee: T Jake Luciani Priority: Blocker Labels: qa-resolved Fix For: 2.1.0 Attachments: 0001-CASSANDRA-7695-Workaround-Netty-bug-by-not-use-Compo.patch, 7695-workaround.txt, PutFailureRepro.java, bad-data-tid43-get, bad-data-tid43-put Running the attached test program against a cassandra 2.1 server results in scrambled data returned by the SELECT statement. Running it against latest stable works fine. Attached: * Program that reproduces the failure * Example output files from mentioned test-program with the scrambled output. Failure mode: The value returned by 'get' is scrambled, the size is correct but some bytes have shifted locations in the returned buffer. Cluster info: For the test we set up a single cassandra node using the stock configuration file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6861) Optimise our Netty 4 integration
[ https://issues.apache.org/jira/browse/CASSANDRA-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13992883#comment-13992883 ] Norman Maurer commented on CASSANDRA-6861: -- We will most likely ship our OpensslEngine in the next release of Netty. That said it only support server-side usage atm. Optimise our Netty 4 integration Key: CASSANDRA-6861 URL: https://issues.apache.org/jira/browse/CASSANDRA-6861 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 2.1 rc1 Now we've upgraded to Netty 4, we're generating a lot of garbage that could be avoided, so we should probably stop that. Should be reasonably easy to hook into Netty's pooled buffers, returning them to the pool once a given message is completed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6861) Optimise our Netty 4 integration
[ https://issues.apache.org/jira/browse/CASSANDRA-6861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993380#comment-13993380 ] Norman Maurer commented on CASSANDRA-6861: -- Opened a PR with some improvements: https://github.com/apache/cassandra/pull/35 Optimise our Netty 4 integration Key: CASSANDRA-6861 URL: https://issues.apache.org/jira/browse/CASSANDRA-6861 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Benedict Assignee: T Jake Luciani Priority: Minor Labels: performance Fix For: 2.1 rc1 Now we've upgraded to Netty 4, we're generating a lot of garbage that could be avoided, so we should probably stop that. Should be reasonably easy to hook into Netty's pooled buffers, returning them to the pool once a given message is completed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6236) Update native protocol server to Netty 4
[ https://issues.apache.org/jira/browse/CASSANDRA-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955528#comment-13955528 ] Norman Maurer commented on CASSANDRA-6236: -- There is now even the recorded video for this: https://www.youtube.com/watch?v=_GRIyCMNGGI Anyway what you guys think about having me doing the heavy work and submit a patch? Update native protocol server to Netty 4 Key: CASSANDRA-6236 URL: https://issues.apache.org/jira/browse/CASSANDRA-6236 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Fix For: 2.1 beta2 We should switch to Netty 4 at some point, since it's the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6236) Update native protocol server to Netty 4
[ https://issues.apache.org/jira/browse/CASSANDRA-6236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955542#comment-13955542 ] Norman Maurer commented on CASSANDRA-6236: -- lol.. Ok tell me from which branch etc to start and I will come back to you guys in the next days ;) Update native protocol server to Netty 4 Key: CASSANDRA-6236 URL: https://issues.apache.org/jira/browse/CASSANDRA-6236 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Assignee: Benedict Priority: Minor Fix For: 2.1 beta2 We should switch to Netty 4 at some point, since it's the future. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-6235) Improve native protocol server latency
[ https://issues.apache.org/jira/browse/CASSANDRA-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13812866#comment-13812866 ] Norman Maurer commented on CASSANDRA-6235: -- You are still using 3.x right ? Improve native protocol server latency -- Key: CASSANDRA-6235 URL: https://issues.apache.org/jira/browse/CASSANDRA-6235 Project: Cassandra Issue Type: Improvement Reporter: Sylvain Lebresne Attachments: NPTester.java The tl;dr is that the native protocol server seems to add some non negligeable latency to operations compared to the thrift server. And the added latency seems to lie within Netty's internal as far as I can tell. I'm not sure what to tweak to try to reduce that. The test I ran is simple: it's {{stress -t 1 -L3}}, the Cassandra stress test for insertions with just 1 thread and using CQL-over-thrift (to make things more comparable). What I'm interested in is the average latency. Also, because I don't care about testing the storage engine or even CQL processing, I've disabled the processing of statements: all queries just return an empty result set right away (there's no parsing of the query in particular). The resulting branch is at https://github.com/pcmanus/cassandra/commits/latency-testing (note that there's a trivial patch to have stress show the latency in microseconds). With that branch (single node), I get with thrift ~62μs of average latency. That number is actually fairly stable across runs (not doing any real processing helps having consistent performance here). For the native protocol, I wanted to eliminate the possibility that the DataStax Java driver was the bottleneck so I wrote a very simple class (NPTester.java, attached) that emulates the stress test above but with the native protocol. It's not execssively pretty but its simple (no dependencies, compiles with javac NPTester.java) and it tries to minimize the client side overhead. It's just a basic loop that write query frames (serializing them largely manually) and read the result back. And it measures the latency as close to the socket as possible. Unless I've done something really wrong, it should have less client side overhead than what stress has. With that tester, the average latency I get is ~140μs. This is more than twice that of thrift. To try to understand where that additional latency was spent, I instrumented the Frame coder/decoder to record latencies (last commit of the latency-test branch above): it records how long it takes to decode, execute and re-encode the query. The latency for that is ~35μs (as other numbers above, this is pretty consistent over runs). Given that my ping on localhost is 30μs, this suggest that compared to thrift, Netty spends ~70μs more than the thrift server somewhere while reading and/or writing data on the wire. I've try yourkitting it but I didn't saw anything obvious so I'm not sure what's the problem, but it sure would be nice to get on par (or at least much closer) with thrift on such a simple test. I'll note that if I run the same tests without disabling actual query processing, the tests have a bit more variability, but for thrift I get ~220-230μs latency on average while the NPTester gets ~290-300μs. In other words, there still seems to be that 70μs overhead for the native protocol. Which in that case is still a 30% slowdown. I'll also note that test comparisons with more threads (using the java driver this time) also show the native protocol being slightly slower than thrift (~5-10% slower), and while there might be inefficiencies in the java driver, I'm growing more and more convinced that at least part of it is due to the latency issue described above. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (CASSANDRA-2478) Custom CQL protocol/transport
[ https://issues.apache.org/jira/browse/CASSANDRA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407219#comment-13407219 ] Norman Maurer commented on CASSANDRA-2478: -- @Yuki I think everything is ok.. Let me do a final review cycle today again. Custom CQL protocol/transport - Key: CASSANDRA-2478 URL: https://issues.apache.org/jira/browse/CASSANDRA-2478 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Eric Evans Assignee: Sylvain Lebresne Priority: Minor Labels: cql Attachments: cql_binary_protocol, cql_binary_protocol-v2 A custom wire protocol would give us the flexibility to optimize for our specific use-cases, and eliminate a troublesome dependency (I'm referring to Thrift, but none of the others would be significantly better). Additionally, RPC is bad fit here, and we'd do better to move in the direction of something that natively supports streaming. I don't think this is as daunting as it might seem initially. Utilizing an existing server framework like Netty, combined with some copy-and-paste of bits from other FLOSS projects would probably get us 80% of the way there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-2478) Custom CQL protocol/transport
[ https://issues.apache.org/jira/browse/CASSANDRA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13407260#comment-13407260 ] Norman Maurer commented on CASSANDRA-2478: -- We released 3.5.2.Final today.. not sure if you used it ;) Custom CQL protocol/transport - Key: CASSANDRA-2478 URL: https://issues.apache.org/jira/browse/CASSANDRA-2478 Project: Cassandra Issue Type: New Feature Components: API, Core Reporter: Eric Evans Assignee: Sylvain Lebresne Priority: Minor Labels: cql Fix For: 1.2 Attachments: cql_binary_protocol, cql_binary_protocol-v2 A custom wire protocol would give us the flexibility to optimize for our specific use-cases, and eliminate a troublesome dependency (I'm referring to Thrift, but none of the others would be significantly better). Additionally, RPC is bad fit here, and we'd do better to move in the direction of something that natively supports streaming. I don't think this is as daunting as it might seem initially. Utilizing an existing server framework like Netty, combined with some copy-and-paste of bits from other FLOSS projects would probably get us 80% of the way there. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira