[jira] [Updated] (CASSANDRA-19554) Website - Download section - Update / remove EOL dates
[ https://issues.apache.org/jira/browse/CASSANDRA-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-19554: --- Description: Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in terms of EOL, running unsupported Cassandra versions and they often refer to what is stated in https://cassandra.apache.org/_/download.html (as the only source available?) and don't really think about the dependency to 5.0 GA, but just reflecting EOL date information there. As of April 11, 2024, the download section states the following information: !image-2024-04-11-13-15-52-317.png! According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... Either remove these EOL estimates or keep them stronly maintained aligned with an updated 5.0 GA timeline. Thanks! was: Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in terms of EOL, running unsupported Cassandra versions and they often refer to what is stated in https://cassandra.apache.org/_/download.html (as the only source available) and don't really think about the dependency to 5.0 GA, but just reflecting EOL date information there. As of April 11, 2024, the download section states the following information: !image-2024-04-11-13-15-52-317.png! According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... Either remove these EOL estimates or keep them stronly maintained aligned with an updated 5.0 GA timeline. Thanks! > Website - Download section - Update / remove EOL dates > -- > > Key: CASSANDRA-19554 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19554 > Project: Cassandra > Issue Type: Task > Components: Documentation/Website >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: image-2024-04-11-13-15-52-317.png > > > Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in > terms of EOL, running unsupported Cassandra versions and they often refer to > what is stated in https://cassandra.apache.org/_/download.html (as the only > source available?) and don't really think about the dependency to 5.0 GA, but > just reflecting EOL date information there. > As of April 11, 2024, the download section states the following information: > !image-2024-04-11-13-15-52-317.png! > According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... > Either remove these EOL estimates or keep them stronly maintained aligned > with an updated 5.0 GA timeline. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19554) Website - Download section - Update / remove EOL dates
Thomas Steinmaurer created CASSANDRA-19554: -- Summary: Website - Download section - Update / remove EOL dates Key: CASSANDRA-19554 URL: https://issues.apache.org/jira/browse/CASSANDRA-19554 Project: Cassandra Issue Type: Task Components: Documentation/Website Reporter: Thomas Steinmaurer Attachments: image-2024-04-11-13-15-52-317.png Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in terms of EOL, running unsupported Cassandra versions and they often refer to what is stated in https://cassandra.apache.org/_/download.html (as the only source available) and don't really think about the dependency to 5.0 GA, but just reflecting EOL date information there. As of April 11, 2024, the download section states the following information: !image-2024-04-11-13-15-52-317.png! According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... Either remove these EOL estimates or keep them stronly maintained aligned with an updated 5.0 GA timeline. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19554) Website - Download section - Update / remove EOL dates
[ https://issues.apache.org/jira/browse/CASSANDRA-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-19554: --- Description: Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in terms of EOL, running unsupported Cassandra versions and they often refer to what is stated in https://cassandra.apache.org/_/download.html (as the only source available?) and don't really think about the dependency to 5.0 GA, but just reflecting EOL date information there. As of April 11, 2024, the download section states the following information: !image-2024-04-11-13-15-52-317.png! According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... Either remove these EOL estimates or keep them strongly maintained aligned with an updated 5.0 GA timeline. Thanks! was: Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in terms of EOL, running unsupported Cassandra versions and they often refer to what is stated in https://cassandra.apache.org/_/download.html (as the only source available?) and don't really think about the dependency to 5.0 GA, but just reflecting EOL date information there. As of April 11, 2024, the download section states the following information: !image-2024-04-11-13-15-52-317.png! According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... Either remove these EOL estimates or keep them stronly maintained aligned with an updated 5.0 GA timeline. Thanks! > Website - Download section - Update / remove EOL dates > -- > > Key: CASSANDRA-19554 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19554 > Project: Cassandra > Issue Type: Task > Components: Documentation/Website >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: image-2024-04-11-13-15-52-317.png > > > Enterprise customers with on-prem Cassandra usage can be pretty nitpicking in > terms of EOL, running unsupported Cassandra versions and they often refer to > what is stated in https://cassandra.apache.org/_/download.html (as the only > source available?) and don't really think about the dependency to 5.0 GA, but > just reflecting EOL date information there. > As of April 11, 2024, the download section states the following information: > !image-2024-04-11-13-15-52-317.png! > According to that, 3.x is unmaintained, 4.0 soon to be EOL etc ... > Either remove these EOL estimates or keep them strongly maintained aligned > with an updated 5.0 GA timeline. > Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support macOS M1 arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer reassigned CASSANDRA-18891: -- Assignee: Thomas Steinmaurer (was: Maxim Muzafarov) > Cassandra 4.0 - JNA 5.6.0 does not support macOS M1 arm64 > - > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Assignee: Thomas Steinmaurer >Priority: Normal > Fix For: 4.0.x > > Attachments: signature.asc > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > From linked ticket: > "Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. JNA 5.6.0 does not support arm64 architecture > (Apple M1 devices), causing cassandra to fail on bootstrap." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support macOS M1 arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17782532#comment-17782532 ] Thomas Steinmaurer commented on CASSANDRA-18891: [~mmuzaf], not yet. I'm really sorry, but the whole ticket was a bit misleading then. Sorry for the confusion. I was under the impression that 4.0 and arm64 was a general compatibility issue fixed in 4.1 and not specific to arm64 on Apple Silicon. > Cassandra 4.0 - JNA 5.6.0 does not support macOS M1 arm64 > - > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Assignee: Maxim Muzafarov >Priority: Normal > Fix For: 4.0.x > > Attachments: signature.asc > > Time Spent: 10m > Remaining Estimate: 0h > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > From linked ticket: > "Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. JNA 5.6.0 does not support arm64 architecture > (Apple M1 devices), causing cassandra to fail on bootstrap." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770297#comment-17770297 ] Thomas Steinmaurer commented on CASSANDRA-18891: Sounds great! > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 4.0.x > > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > From linked ticket: > "Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. JNA 5.6.0 does not support arm64 architecture > (Apple M1 devices), causing cassandra to fail on bootstrap." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-18891: --- Description: As discussed on Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS Gravition instances e.g. m7g already with Cassandra 4.0. >From linked ticket: "Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native binding into the C library. JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing cassandra to fail on bootstrap." was: As discussed on Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS Gravition instances e.g. m7g already with Cassandra 4.0. Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native binding into the C library. JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing cassandra to fail on bootstrap. > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 4.0.x > > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > From linked ticket: > "Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. JNA 5.6.0 does not support arm64 architecture > (Apple M1 devices), causing cassandra to fail on bootstrap." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-18891: --- Description: As discussed on Slack: [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS Gravition instances e.g. m7g already with Cassandra 4.0. Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native binding into the C library. JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing cassandra to fail on bootstrap. was: Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native binding into the C library. JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing cassandra to fail on bootstrap. Bumping the dependency to 5.9.0 adds arm64 support. Will a PR to bump the dependency be acceptable ? > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. > JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing > cassandra to fail on bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-18891: --- Fix Version/s: 4.0.x > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 4.0.x > > > As discussed on Slack: > [https://the-asf.slack.com/archives/CJZLTM05A/p1684745250901489] > Created this ticket as clone of CASSANDRA-17019, to ask for considering a JNA > library upgrade in Cassandra 4.0, so that we could utilize ARM-based AWS > Gravition instances e.g. m7g already with Cassandra 4.0. > Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. > JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing > cassandra to fail on bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-18891: --- Fix Version/s: (was: 4.1-alpha1) > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > > Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. > JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing > cassandra to fail on bootstrap. > Bumping the dependency to 5.9.0 adds arm64 support. Will a PR to bump the > dependency be acceptable ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer reassigned CASSANDRA-18891: -- Assignee: (was: Yuqi Gu) > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 4.1-alpha1 > > > Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. > JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing > cassandra to fail on bootstrap. > Bumping the dependency to 5.9.0 adds arm64 support. Will a PR to bump the > dependency be acceptable ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
Thomas Steinmaurer created CASSANDRA-18891: -- Summary: Cassandra 4.0 - JNA 5.6.0 does not support arm64 Key: CASSANDRA-18891 URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 Project: Cassandra Issue Type: Bug Components: Dependencies Reporter: Thomas Steinmaurer Assignee: Yuqi Gu Fix For: 4.1-alpha1 Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native binding into the C library. JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing cassandra to fail on bootstrap. Bumping the dependency to 5.9.0 adds arm64 support. Will a PR to bump the dependency be acceptable ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-18891) Cassandra 4.0 - JNA 5.6.0 does not support arm64
[ https://issues.apache.org/jira/browse/CASSANDRA-18891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-18891: --- Source Control Link: (was: https://github.com/apache/cassandra/commit/2043cb9fb6b25ff34afb90467b9476a09acc3933) > Cassandra 4.0 - JNA 5.6.0 does not support arm64 > > > Key: CASSANDRA-18891 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18891 > Project: Cassandra > Issue Type: Bug > Components: Dependencies >Reporter: Thomas Steinmaurer >Priority: Normal > > Cassandra depends on net.java.dev.jna.jna version 5.6.0 to do the native > binding into the C library. > JNA 5.6.0 does not support arm64 architecture (Apple M1 devices), causing > cassandra to fail on bootstrap. > Bumping the dependency to 5.9.0 adds arm64 support. Will a PR to bump the > dependency be acceptable ? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16555) Add support for AWS Ec2 IMDSv2
[ https://issues.apache.org/jira/browse/CASSANDRA-16555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738372#comment-17738372 ] Thomas Steinmaurer commented on CASSANDRA-16555: Thanks a lot for driving that forward! Quick question though, without having the 3.11 PR checked (in detail): In case IMDSv2 fails (and v2 being the new default, as it seems), for whatever reason, e.g. also while refreshing the token, will there be a silent fallback to v1 (old behavior) + e.g. a log entry in cassandra.log, to remain Cassandra operational? Just a thought, as the default going forward from pre 3.11.16 to 3.11.16 has changed. Thanks a lot. > Add support for AWS Ec2 IMDSv2 > --- > > Key: CASSANDRA-16555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16555 > Project: Cassandra > Issue Type: New Feature > Components: Consistency/Coordination >Reporter: Paul Rütter (BlueConic) >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 3.0.30, 3.11.16, 4.0.11, 4.1.3, 5.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > In order to patch a vulnerability, Amazon came up with a new version of their > metadata service. > It's no longer unrestricted but now requires a token (in a header), in order > to access the metadata service. > See > [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html] > for more information. > Cassandra currently doesn't offer an out-of-the-box snitch class to support > this. > See > [https://cassandra.apache.org/doc/latest/operating/snitch.html#snitch-classes] > This issue asks to add support for this as a separate snitch class. > We'll probably do a PR for this, as we are in the process of developing one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16555) Add out-of-the-box snitch for Ec2 IMDSv2
[ https://issues.apache.org/jira/browse/CASSANDRA-16555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718590#comment-17718590 ] Thomas Steinmaurer commented on CASSANDRA-16555: As there was a "VOTE on 3.11.15" sent out today in dev mailing list, I guess this addition won't make it into 3.11.15. > Add out-of-the-box snitch for Ec2 IMDSv2 > > > Key: CASSANDRA-16555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16555 > Project: Cassandra > Issue Type: New Feature > Components: Consistency/Coordination >Reporter: Paul Rütter (BlueConic) >Assignee: fulco taen >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.x > > Time Spent: 1h 40m > Remaining Estimate: 0h > > In order to patch a vulnerability, Amazon came up with a new version of their > metadata service. > It's no longer unrestricted but now requires a token (in a header), in order > to access the metadata service. > See > [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html] > for more information. > Cassandra currently doesn't offer an out-of-the-box snitch class to support > this. > See > [https://cassandra.apache.org/doc/latest/operating/snitch.html#snitch-classes] > This issue asks to add support for this as a separate snitch class. > We'll probably do a PR for this, as we are in the process of developing one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-18169) Warning at startup in 3.11.11 or above version of Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677731#comment-17677731 ] Thomas Steinmaurer edited comment on CASSANDRA-18169 at 1/17/23 12:13 PM: -- We have seen that in the past as well, where this WARN log produced a bit of confusion after upgrading to 3.11.11+. https://issues.apache.org/jira/browse/CASSANDRA-16619?focusedCommentId=17441530&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17441530 Can be wrong, but perhaps related to the "md" => "me" SSTable format upgrade with 3.11.11+, when 3.11.11+ is reading "md" files upon startup. was (Author: tsteinmaurer): We have seen that in the past as well, where this WARN log produced a bit of confusion after upgrading to 3.11.11+. https://issues.apache.org/jira/browse/CASSANDRA-16619?focusedCommentId=17441530&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17441530 Can be wrong, but perhaps related to the "md" => "me" SSTable format upgrade with 3.11.11+. > Warning at startup in 3.11.11 or above version of Cassandra > --- > > Key: CASSANDRA-18169 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18169 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log >Reporter: Mohammad Aburadeh >Assignee: Jacek Lewandowski >Priority: Normal > Fix For: 3.11.15 > > > We are seeing the following warning in Cassandra 3.11.11/14 at startup : > {code:java} > WARN [main] 2022-12-27 16:41:28,016 CommitLogReplayer.java:253 - Origin of 2 > sstables is unknown or doesn't match the local node; commitLogIntervals for > them were ignored > DEBUG [main] 2022-12-27 16:41:28,016 CommitLogReplayer.java:254 - Ignored > commitLogIntervals from the following sstables: > [/data/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/me-65-big-Data.db, > > /data/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/me-64-big-Data.db] > {code} > It looks like HostID metadata is missing at startup in the system.local > table. > We noticed that this issue does not exist in the 4.0.X version of Cassandra. > Could you please fix it in 3.11.X Cassandra? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-18169) Warning at startup in 3.11.11 or above version of Cassandra
[ https://issues.apache.org/jira/browse/CASSANDRA-18169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17677731#comment-17677731 ] Thomas Steinmaurer commented on CASSANDRA-18169: We have seen that in the past as well, where this WARN log produced a bit of confusion after upgrading to 3.11.11+. https://issues.apache.org/jira/browse/CASSANDRA-16619?focusedCommentId=17441530&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17441530 Can be wrong, but perhaps related to the "md" => "me" SSTable format upgrade with 3.11.11+. > Warning at startup in 3.11.11 or above version of Cassandra > --- > > Key: CASSANDRA-18169 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18169 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log >Reporter: Mohammad Aburadeh >Assignee: Jacek Lewandowski >Priority: Normal > Fix For: 3.11.15 > > > We are seeing the following warning in Cassandra 3.11.11/14 at startup : > {code:java} > WARN [main] 2022-12-27 16:41:28,016 CommitLogReplayer.java:253 - Origin of 2 > sstables is unknown or doesn't match the local node; commitLogIntervals for > them were ignored > DEBUG [main] 2022-12-27 16:41:28,016 CommitLogReplayer.java:254 - Ignored > commitLogIntervals from the following sstables: > [/data/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/me-65-big-Data.db, > > /data/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/me-64-big-Data.db] > {code} > It looks like HostID metadata is missing at startup in the system.local > table. > We noticed that this issue does not exist in the 4.0.X version of Cassandra. > Could you please fix it in 3.11.X Cassandra? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16555) Add out-of-the-box snitch for Ec2 IMDSv2
[ https://issues.apache.org/jira/browse/CASSANDRA-16555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648566#comment-17648566 ] Thomas Steinmaurer edited comment on CASSANDRA-16555 at 12/16/22 10:55 AM: --- I wonder if the dedicated/separate snitch implementation is the best way to move forward, if I may challenge that :). Perhaps would it make more sense to extend the existing {{Ec2Snitch}} implementation to make it configurable for being used to IMDSv2 or perhaps even smarter in a way, that it first automatically detects what is available on the EC2 instance and then simply uses that behind the scene? was (Author: tsteinmaurer): I wonder if the dedicated/separate is the best way to move forward, if I may challenge that :). Perhaps would it make more sense to extend the existing \{{Ec2Snitch}} implementation to make it configurable for being used to IMDSv2 or perhaps even smarter in a way, that it first automatically detects what is available on the EC2 instance and then simply uses that behind the scene? > Add out-of-the-box snitch for Ec2 IMDSv2 > > > Key: CASSANDRA-16555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16555 > Project: Cassandra > Issue Type: New Feature > Components: Consistency/Coordination >Reporter: Paul Rütter (BlueConic) >Assignee: fulco taen >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 4.x > > Time Spent: 1.5h > Remaining Estimate: 0h > > In order to patch a vulnerability, Amazon came up with a new version of their > metadata service. > It's no longer unrestricted but now requires a token (in a header), in order > to access the metadata service. > See > [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html] > for more information. > Cassandra currently doesn't offer an out-of-the-box snitch class to support > this. > See > [https://cassandra.apache.org/doc/latest/operating/snitch.html#snitch-classes] > This issue asks to add support for this as a separate snitch class. > We'll probably do a PR for this, as we are in the process of developing one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16555) Add out-of-the-box snitch for Ec2 IMDSv2
[ https://issues.apache.org/jira/browse/CASSANDRA-16555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648566#comment-17648566 ] Thomas Steinmaurer commented on CASSANDRA-16555: I wonder if the dedicated/separate is the best way to move forward, if I may challenge that :). Perhaps would it make more sense to extend the existing \{{Ec2Snitch}} implementation to make it configurable for being used to IMDSv2 or perhaps even smarter in a way, that it first automatically detects what is available on the EC2 instance and then simply uses that behind the scene? > Add out-of-the-box snitch for Ec2 IMDSv2 > > > Key: CASSANDRA-16555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16555 > Project: Cassandra > Issue Type: New Feature > Components: Consistency/Coordination >Reporter: Paul Rütter (BlueConic) >Assignee: fulco taen >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 4.x > > Time Spent: 1.5h > Remaining Estimate: 0h > > In order to patch a vulnerability, Amazon came up with a new version of their > metadata service. > It's no longer unrestricted but now requires a token (in a header), in order > to access the metadata service. > See > [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html] > for more information. > Cassandra currently doesn't offer an out-of-the-box snitch class to support > this. > See > [https://cassandra.apache.org/doc/latest/operating/snitch.html#snitch-classes] > This issue asks to add support for this as a separate snitch class. > We'll probably do a PR for this, as we are in the process of developing one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-9881) Rows with negative-sized keys can't be skipped by sstablescrub
[ https://issues.apache.org/jira/browse/CASSANDRA-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647675#comment-17647675 ] Thomas Steinmaurer edited comment on CASSANDRA-9881 at 12/15/22 6:12 AM: - Interesting, I'm also getting this with 3.11.14 while trying to scrub a single Cassandra table, where it seems that a single physical SSTable on disk is broken. Seems to be in an infinite loop with the same log line as shown above. No progress according to "nodetool compactionstats" etc ... {noformat} WARN [CompactionExecutor:3252] 2022-12-14 19:29:32,206 UTC OutputHandler.java:57 - Error reading partition (unreadable key) (stacktrace follows): java.io.IOError: java.io.IOException: Unable to read partition key from data file at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:222) at org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:1052) at org.apache.cassandra.db.compaction.CompactionManager.access$200(CompactionManager.java:86) at org.apache.cassandra.db.compaction.CompactionManager$3.execute(CompactionManager.java:399) at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:319) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.IOException: Unable to read partition key from data file {noformat} was (Author: tsteinmaurer): Interesting, I'm also getting this with 3.11.14 while trying to scrub a single Cassandra table, where it seems that a single physical SSTable on disk is broken. Seems to be in an infinite loop with the same log line as shown above. No progress according to "nodetool compactionstats" etc ... > Rows with negative-sized keys can't be skipped by sstablescrub > -- > > Key: CASSANDRA-9881 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9881 > Project: Cassandra > Issue Type: Bug >Reporter: Brandon Williams >Priority: Low > Fix For: 2.1.x > > > It is possible to have corruption in such a way that scrub (on or offline) > can't skip the row, so you end up in a loop where this just keeps repeating: > {noformat} > WARNING: Row starting at position 2087453 is unreadable; skipping to next > Reading row at 2087453 > row (unreadable key) is -1 bytes > {noformat} > The workaround is to just delete the problem sstable since you were going to > have to repair anyway, but it would still be nice to salvage the rest of the > sstable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9881) Rows with negative-sized keys can't be skipped by sstablescrub
[ https://issues.apache.org/jira/browse/CASSANDRA-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647675#comment-17647675 ] Thomas Steinmaurer commented on CASSANDRA-9881: --- Interesting, I'm also getting this with 3.11.14 while trying to scrub a single Cassandra table, where it seems that a single physical SSTable on disk is broken. Seems to be in an infinite loop with the same log line as shown above. No progress according to "nodetool compactionstats" etc ... > Rows with negative-sized keys can't be skipped by sstablescrub > -- > > Key: CASSANDRA-9881 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9881 > Project: Cassandra > Issue Type: Bug >Reporter: Brandon Williams >Priority: Low > Fix For: 2.1.x > > > It is possible to have corruption in such a way that scrub (on or offline) > can't skip the row, so you end up in a loop where this just keeps repeating: > {noformat} > WARNING: Row starting at position 2087453 is unreadable; skipping to next > Reading row at 2087453 > row (unreadable key) is -1 bytes > {noformat} > The workaround is to just delete the problem sstable since you were going to > have to repair anyway, but it would still be nice to salvage the rest of the > sstable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16555) Add out-of-the-box snitch for Ec2 IMDSv2
[ https://issues.apache.org/jira/browse/CASSANDRA-16555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17644718#comment-17644718 ] Thomas Steinmaurer commented on CASSANDRA-16555: [~brandon.williams] many thanks for picking this up! As there are PRs available now, how realistic would it be that this goes into the not yet released 3.11.15? Again, thanks a lot! > Add out-of-the-box snitch for Ec2 IMDSv2 > > > Key: CASSANDRA-16555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16555 > Project: Cassandra > Issue Type: New Feature > Components: Consistency/Coordination >Reporter: Paul Rütter (BlueConic) >Assignee: fulco taen >Priority: Normal > Time Spent: 1.5h > Remaining Estimate: 0h > > In order to patch a vulnerability, Amazon came up with a new version of their > metadata service. > It's no longer unrestricted but now requires a token (in a header), in order > to access the metadata service. > See > [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-service.html] > for more information. > Cassandra currently doesn't offer an out-of-the-box snitch class to support > this. > See > [https://cassandra.apache.org/doc/latest/operating/snitch.html#snitch-classes] > This issue asks to add support for this as a separate snitch class. > We'll probably do a PR for this, as we are in the process of developing one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-17840) IndexOutOfBoundsException in Paging State Version Inference (V3 State Received on V4 Connection)
[ https://issues.apache.org/jira/browse/CASSANDRA-17840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600581#comment-17600581 ] Thomas Steinmaurer commented on CASSANDRA-17840: Any chance this is similar, also fixes CASSANDRA-17507? > IndexOutOfBoundsException in Paging State Version Inference (V3 State > Received on V4 Connection) > > > Key: CASSANDRA-17840 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17840 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Josh McKenzie >Assignee: Josh McKenzie >Priority: Normal > Fix For: 3.11.14, 4.0.6, 4.1, 4.2 > > > In {{PagingState.java}}, {{index}} is an integer field, and we add long > values to it without a {{Math.toIntExact}} check. While we’re checking for > negative return values returned by {{getUnsignedVInt}}, there's a chance that > the value returned by it is so large that addition operation would cause > integer overflow, or the value itself is large enough to cause overflow. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-17507) IllegalArgumentException in query code path during 3.11.12 => 4.0.3 rolling upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516623#comment-17516623 ] Thomas Steinmaurer commented on CASSANDRA-17507: According to [https://the-asf.slack.com/archives/CJZLTM05A/p1648727883515419,] not known, possibly a bug causing queries to fail during the rolling upgrade, thus I have opened this ticket. > IllegalArgumentException in query code path during 3.11.12 => 4.0.3 rolling > upgrade > --- > > Key: CASSANDRA-17507 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17507 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 4.0.x > > > In a 6 node 3.11.12 test cluster - freshly set up, thus no legacy SSTables > etc. - with ~ 1TB SSTables on disk per node, I have been running a rolling > upgrade to 4.0.3. On upgraded 4.0.3 nodes I then have seen the following > exception regularly, which disappeared once all 6 nodes have been on 4.0.3. > Is this known? Can this be ignored? As said, just a test drive, but not sure > if we want to have that in production, especially with a larger number of > nodes, where it could take some time, until all are upgraded. Thanks! > {code} > ERROR [Native-Transport-Requests-8] 2022-03-30 11:30:24,057 > ErrorMessage.java:457 - Unexpected exception during request > java.lang.IllegalArgumentException: newLimit > capacity: (290 > 15) > at java.base/java.nio.Buffer.createLimitException(Buffer.java:372) > at java.base/java.nio.Buffer.limit(Buffer.java:346) > at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107) > at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:262) > at > org.apache.cassandra.db.marshal.ByteBufferAccessor.slice(ByteBufferAccessor.java:107) > at > org.apache.cassandra.db.marshal.ByteBufferAccessor.slice(ByteBufferAccessor.java:39) > at > org.apache.cassandra.db.marshal.ValueAccessor.sliceWithShortLength(ValueAccessor.java:225) > at > org.apache.cassandra.db.marshal.CompositeType.splitName(CompositeType.java:222) > at > org.apache.cassandra.service.pager.PagingState$RowMark.decodeClustering(PagingState.java:434) > at > org.apache.cassandra.service.pager.PagingState$RowMark.clustering(PagingState.java:388) > at > org.apache.cassandra.service.pager.SinglePartitionPager.nextPageReadQuery(SinglePartitionPager.java:88) > at > org.apache.cassandra.service.pager.SinglePartitionPager.nextPageReadQuery(SinglePartitionPager.java:32) > at > org.apache.cassandra.service.pager.AbstractQueryPager.fetchPage(AbstractQueryPager.java:69) > at > org.apache.cassandra.service.pager.SinglePartitionPager.fetchPage(SinglePartitionPager.java:32) > at > org.apache.cassandra.cql3.statements.SelectStatement$Pager$NormalPager.fetchPage(SelectStatement.java:352) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:400) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:250) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:88) > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:244) > at > org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:723) > at > org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:701) > at > org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:159) > at > org.apache.cassandra.transport.Message$Request.execute(Message.java:242) > at > org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:86) > at > org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:106) > at > org.apache.cassandra.transport.Dispatcher.lambda$dispatch$0(Dispatcher.java:70) > at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:829) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e
[jira] [Created] (CASSANDRA-17507) IllegalArgumentException in query code path during 3.11.12 => 4.0.3 rolling upgrade
Thomas Steinmaurer created CASSANDRA-17507: -- Summary: IllegalArgumentException in query code path during 3.11.12 => 4.0.3 rolling upgrade Key: CASSANDRA-17507 URL: https://issues.apache.org/jira/browse/CASSANDRA-17507 Project: Cassandra Issue Type: Bug Reporter: Thomas Steinmaurer In a 6 node 3.11.12 test cluster - freshly set up, thus no legacy SSTables etc. - with ~ 1TB SSTables on disk per node, I have been running a rolling upgrade to 4.0.3. On upgraded 4.0.3 nodes I then have seen the following exception regularly, which disappeared once all 6 nodes have been on 4.0.3. Is this known? Can this be ignored? As said, just a test drive, but not sure if we want to have that in production, especially with a larger number of nodes, where it could take some time, until all are upgraded. Thanks! {code} ERROR [Native-Transport-Requests-8] 2022-03-30 11:30:24,057 ErrorMessage.java:457 - Unexpected exception during request java.lang.IllegalArgumentException: newLimit > capacity: (290 > 15) at java.base/java.nio.Buffer.createLimitException(Buffer.java:372) at java.base/java.nio.Buffer.limit(Buffer.java:346) at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:1107) at java.base/java.nio.ByteBuffer.limit(ByteBuffer.java:262) at org.apache.cassandra.db.marshal.ByteBufferAccessor.slice(ByteBufferAccessor.java:107) at org.apache.cassandra.db.marshal.ByteBufferAccessor.slice(ByteBufferAccessor.java:39) at org.apache.cassandra.db.marshal.ValueAccessor.sliceWithShortLength(ValueAccessor.java:225) at org.apache.cassandra.db.marshal.CompositeType.splitName(CompositeType.java:222) at org.apache.cassandra.service.pager.PagingState$RowMark.decodeClustering(PagingState.java:434) at org.apache.cassandra.service.pager.PagingState$RowMark.clustering(PagingState.java:388) at org.apache.cassandra.service.pager.SinglePartitionPager.nextPageReadQuery(SinglePartitionPager.java:88) at org.apache.cassandra.service.pager.SinglePartitionPager.nextPageReadQuery(SinglePartitionPager.java:32) at org.apache.cassandra.service.pager.AbstractQueryPager.fetchPage(AbstractQueryPager.java:69) at org.apache.cassandra.service.pager.SinglePartitionPager.fetchPage(SinglePartitionPager.java:32) at org.apache.cassandra.cql3.statements.SelectStatement$Pager$NormalPager.fetchPage(SelectStatement.java:352) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:400) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:250) at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:88) at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:244) at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:723) at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:701) at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:159) at org.apache.cassandra.transport.Message$Request.execute(Message.java:242) at org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:86) at org.apache.cassandra.transport.Dispatcher.processRequest(Dispatcher.java:106) at org.apache.cassandra.transport.Dispatcher.lambda$dispatch$0(Dispatcher.java:70) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-17204) Upgrade to Logback 1.2.8 (security)
[ https://issues.apache.org/jira/browse/CASSANDRA-17204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17463855#comment-17463855 ] Thomas Steinmaurer commented on CASSANDRA-17204: Should 1.2.9 perhaps be used? > Upgrade to Logback 1.2.8 (security) > --- > > Key: CASSANDRA-17204 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17204 > Project: Cassandra > Issue Type: Improvement > Components: Dependencies >Reporter: Jochen Schalanda >Assignee: Brandon Williams >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.x > > > Logback 1.2.8 has been released with a fix for a potential vulnerability in > its JNDI lookup. > * [http://logback.qos.ch/news.html] > * [https://jira.qos.ch/browse/LOGBACK-1591] > {quote}*14th of December, 2021, Release of version 1.2.8* > We note that the vulnerability mentioned in LOGBACK-1591 requires write > access to logback's configuration file as a prerequisite. > * • In response to LOGBACK-1591, we have disabled all JNDI lookup code in > logback until further notice. This impacts {{ContextJNDISelector}} and > {{}} element in configuration files. > * Also in response to LOGBACK-1591, we have removed all database (JDBC) > related code in the project with no replacement. > We note that the vulnerability mentioned in LOGBACK-1591 requires write > access to logback's configuration file as a prerequisite. A successful RCE > requires all of the following to be true: > * write access to logback.xml > * use of versions < 1.2.8 > * reloading of poisoned configuration data, which implies application restart > or scan="true" set prior to attack > Therefore and as an additional precaution, in addition to upgrading to > version 1.2.8, we also recommend users to set their logback configuration > files as read-only. > {quote} > This is not as bad as CVE-2021-44228 in Log4j <2.15.0 (Log4Shell), but should > probably be fixed anyway. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16619) Loss of commit log data possible after sstable ingest
[ https://issues.apache.org/jira/browse/CASSANDRA-16619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441530#comment-17441530 ] Thomas Steinmaurer commented on CASSANDRA-16619: Regarding the WARN log, which got introduced by that ticket, e.g.: {noformat} WARN [main] 2021-11-08 21:54:06,826 CommitLogReplayer.java:253 - Origin of 1 sstables is unknown or doesn't match the local node; commitLogIntervals for them were ignored {noformat} While I understand the intention to ensure / avoid things when SSTables have been copied around (or e.g. due to a restore), the WARN log also seems to happen when Cassandra 3.11.11 reads pre-"*me*" SSTables, thus e.g. from 3.11.10. I understand that the WARN log will go away eventually on its own resp. for sure (I guess?) after running "nodetool upgradesstables". These sort of WARN log has produced quite some confusion and customer interaction for on-premise customer installations. * Would it be possible to WARN only if we are in context of a "me" SSTable to avoid confusion after upgrading from pre-3.11.11? * Would it be possible to mention a SSTable minor upgrade in e.g. {{NEWS.txt}} (or perhaps I missed it), as there might be tooling out there which counts number of SSTables per "format" via file name Many thanks. > Loss of commit log data possible after sstable ingest > - > > Key: CASSANDRA-16619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16619 > Project: Cassandra > Issue Type: Bug > Components: Local/Commit Log >Reporter: Jacek Lewandowski >Assignee: Jacek Lewandowski >Priority: Normal > Fix For: 3.0.25, 3.11.11, 4.0-rc2, 4.0 > > Time Spent: 2h > Remaining Estimate: 0h > > SSTable metadata contains commit log positions of the sstable. These > positions are used to filter out mutations from the commit log on restart and > only make sense for the node on which the data was flushed. > If an SSTable is moved between nodes they may cover regions that the > receiving node has not yet flushed, and result in valid data being lost > should these sections of the commit log need to be replayed. > Solution: > The chosen solution introduces a new sstable metadata (StatsMetadata) - > originatingHostId (UUID), which is the local host id of the node on which the > sstable was created, or null if not known. Commit log intervals from an > sstable are taken into account during Commit Log replay only when the > originatingHostId of the sstable matches the local node's hostId. > For new sstables the originatingHostId is set according to StorageService's > local hostId. > For compacted sstables the originatingHostId set according to > StorageService's local hostId, and only commit log intervals from local > sstables is preserved in the resulting sstable. > discovered by [~jakubzytka] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-11418) Nodetool status should reflect hibernate/replacing states
[ https://issues.apache.org/jira/browse/CASSANDRA-11418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17392008#comment-17392008 ] Thomas Steinmaurer commented on CASSANDRA-11418: As 4.0 has been released, is this something which could be picked up in the near future, perhaps even for the 3.11 series? Reason is that showing a node in normal state when using replace_address is not only confusing for operators, but especially for any automation/monitoring tooling behind a Cassandra cluster. Additionally, when a Cassandra process disables Gossip (and client protocols) due to disk issues, other nodes will see it as DN but running nodetool on this particular node will report UN, although Gossip is disabled. This is additionally confusing for any automation/monitoring tooling. > Nodetool status should reflect hibernate/replacing states > - > > Key: CASSANDRA-11418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11418 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Observability, Tool/nodetool >Reporter: Joel Knighton >Assignee: Shaurya Gupta >Priority: Low > Labels: lhf > Fix For: 4.x > > Attachments: cassandra-11418-trunk > > > Currently, the four options for state in nodetool status are > joining/leaving/moving/normal. > Joining nodes are determined based on bootstrap tokens, leaving nodes are > based on leaving endpoints in TokenMetadata, moving nodes are based on moving > endpoints in TokenMetadata. > This means that a node will appear in normal state when going through a > bootstrap with flag replace_address, which can be confusing to operators. > We should add another state for hibernation/replacing to make this visible. > This will require a way to get a list of all hibernating endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-11418) Nodetool status should reflect hibernate/replacing states
[ https://issues.apache.org/jira/browse/CASSANDRA-11418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer reassigned CASSANDRA-11418: -- Assignee: Shaurya Gupta (was: Thomas Steinmaurer) > Nodetool status should reflect hibernate/replacing states > - > > Key: CASSANDRA-11418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11418 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Observability, Tool/nodetool >Reporter: Joel Knighton >Assignee: Shaurya Gupta >Priority: Low > Labels: lhf > Fix For: 4.x > > Attachments: cassandra-11418-trunk > > > Currently, the four options for state in nodetool status are > joining/leaving/moving/normal. > Joining nodes are determined based on bootstrap tokens, leaving nodes are > based on leaving endpoints in TokenMetadata, moving nodes are based on moving > endpoints in TokenMetadata. > This means that a node will appear in normal state when going through a > bootstrap with flag replace_address, which can be confusing to operators. > We should add another state for hibernation/replacing to make this visible. > This will require a way to get a list of all hibernating endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-11418) Nodetool status should reflect hibernate/replacing states
[ https://issues.apache.org/jira/browse/CASSANDRA-11418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-11418: --- Authors: Shaurya Gupta (was: Thomas Steinmaurer) > Nodetool status should reflect hibernate/replacing states > - > > Key: CASSANDRA-11418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11418 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Observability, Tool/nodetool >Reporter: Joel Knighton >Assignee: Thomas Steinmaurer >Priority: Low > Labels: lhf > Fix For: 4.x > > Attachments: cassandra-11418-trunk > > > Currently, the four options for state in nodetool status are > joining/leaving/moving/normal. > Joining nodes are determined based on bootstrap tokens, leaving nodes are > based on leaving endpoints in TokenMetadata, moving nodes are based on moving > endpoints in TokenMetadata. > This means that a node will appear in normal state when going through a > bootstrap with flag replace_address, which can be confusing to operators. > We should add another state for hibernation/replacing to make this visible. > This will require a way to get a list of all hibernating endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-11418) Nodetool status should reflect hibernate/replacing states
[ https://issues.apache.org/jira/browse/CASSANDRA-11418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer reassigned CASSANDRA-11418: -- Assignee: Thomas Steinmaurer (was: Shaurya Gupta) > Nodetool status should reflect hibernate/replacing states > - > > Key: CASSANDRA-11418 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11418 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Observability, Tool/nodetool >Reporter: Joel Knighton >Assignee: Thomas Steinmaurer >Priority: Low > Labels: lhf > Fix For: 4.x > > Attachments: cassandra-11418-trunk > > > Currently, the four options for state in nodetool status are > joining/leaving/moving/normal. > Joining nodes are determined based on bootstrap tokens, leaving nodes are > based on leaving endpoints in TokenMetadata, moving nodes are based on moving > endpoints in TokenMetadata. > This means that a node will appear in normal state when going through a > bootstrap with flag replace_address, which can be confusing to operators. > We should add another state for hibernation/replacing to make this visible. > This will require a way to get a list of all hibernating endpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16442) Improve handling of failed prepared statement loading
[ https://issues.apache.org/jira/browse/CASSANDRA-16442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16442: --- Fix Version/s: 3.11.x > Improve handling of failed prepared statement loading > - > > Key: CASSANDRA-16442 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16442 > Project: Cassandra > Issue Type: Improvement >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.11.x > > > In an internal DEV cluster, when going from 3.0 to 3.11 we have seen the > following WARN logs constantly upon Cassandra startup. > {noformat} > ... > WARN [main] 2021-02-05 09:25:06,892 QueryProcessor.java:160 - prepared > statement recreation error: SELECT n,v FROM "Ts2Volatile60Min" WHERE k=? > LIMIT ?; > WARN [main] 2021-02-05 09:25:06,895 QueryProcessor.java:160 - prepared > statement recreation error: INSERT INTO "Ts2Final01Min" (k,n,v) VALUES > (?,?,?) USING TIMESTAMP ?; > ... > {noformat} > I guess 3.11 tries to pre-load prepared statements for tables which don't > exist anymore. On how we got into this situation was our fault I think (Cas > 3.0 => Upgrade 3.11 => Downgrade 3.0 => with 3.0 some tables got dropped => > Upgrade 3.11.10). > Still, perhaps there is room for improvement when it comes to loading > persisted prepared statements, which might fail. > I thought about: > * An additional {{nodetool}} option to wipe the persisted prepared statement > cache > * Perhaps even make the startup code smarter in a way, when loading of a > prepared statement fails, due to a table not being available anymore, then > auto-wipe such entries from the {{prepared_statements}} system table > To get rid of the WARN log, I currently need to work directly on the > "prepared_statements" system table, but I don't know if it is safe to run > e.g. a TRUNCATE statement, thus currently, it seems we need to take each node > offline, execute a Linux {{rm}} command on SSTables for the system table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16442) Improve handling of failed prepared statement loading
Thomas Steinmaurer created CASSANDRA-16442: -- Summary: Improve handling of failed prepared statement loading Key: CASSANDRA-16442 URL: https://issues.apache.org/jira/browse/CASSANDRA-16442 Project: Cassandra Issue Type: Improvement Reporter: Thomas Steinmaurer In an internal DEV cluster, when going from 3.0 to 3.11 we have seen the following WARN logs constantly upon Cassandra startup. {noformat} ... WARN [main] 2021-02-05 09:25:06,892 QueryProcessor.java:160 - prepared statement recreation error: SELECT n,v FROM "Ts2Volatile60Min" WHERE k=? LIMIT ?; WARN [main] 2021-02-05 09:25:06,895 QueryProcessor.java:160 - prepared statement recreation error: INSERT INTO "Ts2Final01Min" (k,n,v) VALUES (?,?,?) USING TIMESTAMP ?; ... {noformat} I guess 3.11 tries to pre-load prepared statements for tables which don't exist anymore. On how we got into this situation was our fault I think (Cas 3.0 => Upgrade 3.11 => Downgrade 3.0 => with 3.0 some tables got dropped => Upgrade 3.11.10). Still, perhaps there is room for improvement when it comes to loading persisted prepared statements, which might fail. I thought about: * An additional {{nodetool}} option to wipe the persisted prepared statement cache * Perhaps even make the startup code smarter in a way, when loading of a prepared statement fails, due to a table not being available anymore, then auto-wipe such entries from the {{prepared_statements}} system table To get rid of the WARN log, I currently need to work directly on the "prepared_statements" system table, but I don't know if it is safe to run e.g. a TRUNCATE statement, thus currently, it seems we need to take each node offline, execute a Linux {{rm}} command on SSTables for the system table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264975#comment-17264975 ] Thomas Steinmaurer commented on CASSANDRA-16201: Any ideas if this will make it into 3.0.24? > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14701) Cleanup (and other) compaction type(s) not counted in compaction remaining time
[ https://issues.apache.org/jira/browse/CASSANDRA-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14701: --- Fix Version/s: 3.11.x 3.0.x > Cleanup (and other) compaction type(s) not counted in compaction remaining > time > --- > > Key: CASSANDRA-14701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14701 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Observability >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x, 3.11.x > > > Opened a ticket, as discussed in user list. > Looks like compaction remaining time only includes compactions of type > COMPACTION and other compaction types like cleanup etc. aren't part of the > estimation calculation. > E.g. from one of our environments: > {noformat} > nodetool compactionstats -H > pending tasks: 1 >compaction type keyspace table completed totalunit > progress >CleanupXXX YYY 908.16 GB 1.13 TB bytes > 78.63% > Active compaction remaining time : 0h00m00s > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject repairs with anti-compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Description: We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs. Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active. I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs, e.g. if someone executes nodetool repair ... the wrong way (accidentally). was: We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs. Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active. I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs. > Global configuration parameter to reject repairs with anti-compaction > - > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair, Local/Config >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x, 3.11.x > > > We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the > Cassandra repair area changed significantly / got more complex. Beside > incremental repairs not working reliably, also full repairs (-full > command-line option) are running into anti-compaction code paths, splitting > repaired / non-repaired data into separate SSTables, even with full repairs. > Casandra 4.x (with repair enhancements) is quite away for us (for production > usage), thus we want to avoid anti-compactions with Cassandra 3.x at any > cost. Especially for our on-premise installations at our customer sites, with > less control over on how e.g. nodetool is used, we simply want to have a > configuration parameter in e.g. cassandra.yaml, which we could use to reject > any repair invocations that results in anti-compaction being active. > I know, such a flag still can be flipped then (by the customer), but as a > first safety stage possibly sufficient enough to reject anti-compaction > repairs, e.g. if someone executes nodetool repair ... the wrong way > (accidentally). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject repairs with anti-compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Fix Version/s: (was: 4.x) (was: 2.2.x) > Global configuration parameter to reject repairs with anti-compaction > - > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair, Local/Config >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x, 3.11.x > > > We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the > Cassandra repair area changed significantly / got more complex. Beside > incremental repairs not working reliably, also full repairs (-full > command-line option) are running into anti-compaction code paths, splitting > repaired / non-repaired data into separate SSTables, even with full repairs. > Casandra 4.x (with repair enhancements) is quite away for us (for production > usage), thus we want to avoid anti-compactions with Cassandra 3.x at any > cost. Especially for our on-premise installations at our customer sites, with > less control over on how e.g. nodetool is used, we simply want to have a > configuration parameter in e.g. cassandra.yaml, which we could use to reject > any repair invocations that results in anti-compaction being active. > I know, such a flag still can be flipped then (by the customer), but as a > first safety stage possibly sufficient enough to reject anti-compaction > repairs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject repairs with anti-compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Description: We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the Cassandra repair area changed significantly / got more complex. Beside incremental repairs not working reliably, also full repairs (-full command-line option) are running into anti-compaction code paths, splitting repaired / non-repaired data into separate SSTables, even with full repairs. Casandra 4.x (with repair enhancements) is quite away for us (for production usage), thus we want to avoid anti-compactions with Cassandra 3.x at any cost. Especially for our on-premise installations at our customer sites, with less control over on how e.g. nodetool is used, we simply want to have a configuration parameter in e.g. cassandra.yaml, which we could use to reject any repair invocations that results in anti-compaction being active. I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough to reject anti-compaction repairs. was: We are running Cassandra in AWS and On-Premise at customer sites, currently 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time we end up in incremental repairs being enabled / ran a first time unintentionally, cause: a) A lot of online resources / examples do not use the _-full_ command-line option available since 2.2 (?) b) Our internal (support) tickets of course also state nodetool repair command without the -full option, as these examples are for 2.1 Especially for On-Premise customers (with less control than with our AWS deployments), this asks a bit for getting out-of-control once we have 3.11 out and nodetool repair being run without the -full command-line option. With troubles incremental repair are introducing and incremental being the default since 2.2 (?), what do you think about a JVM system property, cassandra.yaml setting or whatever … to basically let the cluster administrator chose if incremental repairs are allowed or not? I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough. > Global configuration parameter to reject repairs with anti-compaction > - > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair, Local/Config >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > We have moved from Cassandra 2.1 to 3.0 and from an operational aspect, the > Cassandra repair area changed significantly / got more complex. Beside > incremental repairs not working reliably, also full repairs (-full > command-line option) are running into anti-compaction code paths, splitting > repaired / non-repaired data into separate SSTables, even with full repairs. > Casandra 4.x (with repair enhancements) is quite away for us (for production > usage), thus we want to avoid anti-compactions with Cassandra 3.x at any > cost. Especially for our on-premise installations at our customer sites, with > less control over on how e.g. nodetool is used, we simply want to have a > configuration parameter in e.g. cassandra.yaml, which we could use to reject > any repair invocations that results in anti-compaction being active. > I know, such a flag still can be flipped then (by the customer), but as a > first safety stage possibly sufficient enough to reject anti-compaction > repairs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject repairs with anti-compaction
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Summary: Global configuration parameter to reject repairs with anti-compaction (was: Global configuration parameter to reject increment repair and allow full repair only) > Global configuration parameter to reject repairs with anti-compaction > - > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Repair, Local/Config >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > We are running Cassandra in AWS and On-Premise at customer sites, currently > 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. > In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time > we end up in incremental repairs being enabled / ran a first time > unintentionally, cause: > a) A lot of online resources / examples do not use the _-full_ command-line > option available since 2.2 (?) > b) Our internal (support) tickets of course also state nodetool repair > command without the -full option, as these examples are for 2.1 > Especially for On-Premise customers (with less control than with our AWS > deployments), this asks a bit for getting out-of-control once we have 3.11 > out and nodetool repair being run without the -full command-line option. > With troubles incremental repair are introducing and incremental being the > default since 2.2 (?), what do you think about a JVM system property, > cassandra.yaml setting or whatever … to basically let the cluster > administrator chose if incremental repairs are allowed or not? I know, such a > flag still can be flipped then (by the customer), but as a first safety stage > possibly sufficient enough. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239087#comment-17239087 ] Thomas Steinmaurer commented on CASSANDRA-16201: Do we have an ETA for the patch being included/merged? > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15563) Backport removal of OpenJDK warning log
[ https://issues.apache.org/jira/browse/CASSANDRA-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15563: --- Description: As requested on ASF Slack, creating this ticket for a backport of CASSANDRA-13916 for 3.0. (was: As requested on Slack, creating this ticket for a backport of CASSANDRA-13916 for 3.0.) > Backport removal of OpenJDK warning log > --- > > Key: CASSANDRA-15563 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15563 > Project: Cassandra > Issue Type: Task >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x > > > As requested on ASF Slack, creating this ticket for a backport of > CASSANDRA-13916 for 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15563) Backport removal of OpenJDK warning log
[ https://issues.apache.org/jira/browse/CASSANDRA-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15563: --- Description: As requested on Slack, creating this ticket for a backport of CASSANDRA-13916 for 3.0. (was: As requested on Slack, creating this ticket for a backport of CASSANDRA-13916, potentially to 2.2 and 3.0.) > Backport removal of OpenJDK warning log > --- > > Key: CASSANDRA-15563 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15563 > Project: Cassandra > Issue Type: Task >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x > > > As requested on Slack, creating this ticket for a backport of CASSANDRA-13916 > for 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15563) Backport removal of OpenJDK warning log
[ https://issues.apache.org/jira/browse/CASSANDRA-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15563: --- Fix Version/s: (was: 2.2.x) > Backport removal of OpenJDK warning log > --- > > Key: CASSANDRA-15563 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15563 > Project: Cassandra > Issue Type: Task >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x > > > As requested on Slack, creating this ticket for a backport of > CASSANDRA-13916, potentially to 2.2 and 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13900) Massive GC suspension increase after updating to 3.0.14 from 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-13900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-13900: --- Resolution: Duplicate Status: Resolved (was: Open) DUP of CASSANDRA-16201 > Massive GC suspension increase after updating to 3.0.14 from 2.1.18 > --- > > Key: CASSANDRA-13900 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13900 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Core >Reporter: Thomas Steinmaurer >Priority: Urgent > Attachments: cassandra2118_vs_3014.jpg, cassandra3014_jfr_5min.jpg, > cassandra_3.11.0_min_memory_utilization.jpg > > > In short: After upgrading to 3.0.14 (from 2.1.18), we aren't able to process > the same incoming write load on the same infrastructure anymore. > We have a loadtest environment running 24x7 testing our software using > Cassandra as backend. Both, loadtest and production is hosted in AWS and do > have the same spec on the Cassandra-side, namely: > * 9x m4.xlarge > * 8G heap > * CMS (400MB newgen) > * 2TB EBS gp2 > * Client requests are entirely CQL > per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster > AVG with constant, simulated load running against our cluster, using > Cassandra 2.1 for > 2 years now. > Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, > and basically, 3.0.14 isn't able to cope with the load anymore. No particular > special tweaks, memory settings/changes etc., all the same as in 2.1.18. We > also didn't upgrade sstables yet, thus the increase mentioned in the > screenshot is not related to any manually triggered maintenance operation > after upgrading to 3.0.14. > According to our monitoring, with 3.0.14, we see a *GC suspension time > increase by a factor of > 2*, of course directly correlating with an CPU > increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg" > This all means that our incoming load against 2.1.18 is something, 3.0.14 > can't handle. So, we would need to either scale up (e.g. m4.xlarge => > m4.2xlarge) or scale out for being able to handle the same load, which is > cost-wise not an option. > Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the > mentioned load, but can provide JFR session for our current 3.0.14 setup. The > attached 5min JFR memory allocation area (cassandra3014_jfr_5min.jpg) shows > compaction being the top contributor for the captured 5min time-frame. Could > be by "accident" covering the 5min with compaction as top contributor only > (although mentioned simulated client load is attached), but according to > stack traces, we see new classes from 3.0, e.g. BTreeRow.searchIterator() > etc. popping up as top contributor, thus possibly new classes / data > structures are causing much more object churn now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225957#comment-17225957 ] Thomas Steinmaurer edited comment on CASSANDRA-16201 at 11/4/20, 9:56 AM: -- [~mck], thanks a lot for the extensive follow-up. In our tests, from where the actual JFR files come from, we now see, that Cassandra 3.11 and Cassandra 4.0 is basically on the same level than 3.0 again, or even slightly better than 3.0, but 2.1 unbeaten :-) Do you see any further improvements in regard to 2.1 vs. 3.0/3.11/4.0? Following chart is an AVG for last 24hrs on a bunch of metrics, for all versions with the patch applied for 3.0/3.11/4.0, processing the same ingest. The only main difference here is, that 2.1 is using STCS for our timeseries tables, whereas 3.0+ is using TWCS. !screenshot-4.png|width=100%! So, in short: || ||Cassandra 2.1||Cassandra 3.0 Patched (Rel. diff to 2.1)||Cassandra 3.11 Patched (Rel. diff to 2.1)||Cassandra 4.0 Patched (Rel. diff to 2.1)|| |AVG CPU|52,86%|61,43% (+16,2%)|61,04% (+15,5%)|75,06% (+42%)| |AVG Suspension|3,76%|6,13% (+63%)|5,74% (+52,7%)|5,60% (+48,9%)| But for *Cassandra 3.11* and *Cassandra 4.0*, this was a huge step forward! Thanks a lot! was (Author: tsteinmaurer): [~mck], thanks a lot for the extensive follow-up. In our tests, from where the actual JFR files come from, we now see, that Cassandra 3.11 and Cassandra 4.0 is basically on the same level than 3.0 again, or even slightly better than 3.0, but 2.1 unbeaten :-) Do you see any further improvements in regard to 2.1 vs. 3.0/3.11/4.0? Following chart is an AVG for last 24hrs on a bunch of metrics, for all versions with the patch applied for 3.0/3.11/4.0, processing the same ingest. The only main difference here is, that 2.1 is using STCS for our timeseries tables, whereas 3.0+ is using TWCS. !screenshot-4.png|width=100%! So, in short: || ||Cassandra 2.1||Cassandra 3.0 Patched (Rel. diff to 2.1)||Cassandra 3.11 Patched (Rel. diff to 2.1)||Cassandra 4.0 Patched (Rel. diff to 2.1)|| |AVG CPU|52,86%|61,43% (+16,2%)|61,04% (+15,5%)|75,06% (+42%)| |AVG Suspension|3,76%|6,13% (+63%)|5,74% (+52,7%)|5,60% (+48,9%)| > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17225957#comment-17225957 ] Thomas Steinmaurer commented on CASSANDRA-16201: [~mck], thanks a lot for the extensive follow-up. In our tests, from where the actual JFR files come from, we now see, that Cassandra 3.11 and Cassandra 4.0 is basically on the same level than 3.0 again, or even slightly better than 3.0, but 2.1 unbeaten :-) Do you see any further improvements in regard to 2.1 vs. 3.0/3.11/4.0? Following chart is an AVG for last 24hrs on a bunch of metrics, for all versions with the patch applied for 3.0/3.11/4.0, processing the same ingest. The only main difference here is, that 2.1 is using STCS for our timeseries tables, whereas 3.0+ is using TWCS. !screenshot-4.png|width=100%! So, in short: || ||Cassandra 2.1||Cassandra 3.0 Patched (Rel. diff to 2.1)||Cassandra 3.11 Patched (Rel. diff to 2.1)||Cassandra 4.0 Patched (Rel. diff to 2.1)|| |AVG CPU|52,86%|61,43% (+16,2%)|61,04% (+15,5%)|75,06% (+42%)| |AVG Suspension|3,76%|6,13% (+63%)|5,74% (+52,7%)|5,60% (+48,9%)| > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Attachment: screenshot-4.png > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: 16201_jfr_3023_alloc.png, 16201_jfr_3023_obj.png, > 16201_jfr_3118_alloc.png, 16201_jfr_3118_obj.png, 16201_jfr_40b3_alloc.png, > 16201_jfr_40b3_obj.png, screenshot-1.png, screenshot-2.png, screenshot-3.png, > screenshot-4.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17223585#comment-17223585 ] Thomas Steinmaurer commented on CASSANDRA-16201: {quote} I will keep that running over the night and provide another set of JFR files for 3.0, 3.11 and 4.0 with the patch. {quote} [~mck], in my provided OneDrive share provided on Oct 12, 2020 to you, there is now an additional sub-directory called {{_perffixes_jfr_20201027}}, which contains a new set of JFR files for all versions (including 2.1), with the patch applied to 3.0, 3.11 and 4.0. [~marcuse], let me know if/how I could share the new JFR files with you as well. > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218301#comment-17218301 ] Thomas Steinmaurer commented on CASSANDRA-16201: [~marcuse], first impression from our comparison infrastructure regarding the 3.0, 3.11 and 4.0 patches. When having a look on 2 high-level metrics: * JVM suspension, marked as "1" in the dashboard below * Cassandra dropped messages, marked as "2" in the dashboard below !screenshot-3.png|width=100%! * Cassandra 3.0: No positive impact on suspension * Cassandra 3.11: Huge positive impact on suspension * Cassandra 4.0: Huge positive impact on suspension + no dropped messages with the patch I will keep that running over the night and provide another set of JFR files for 3.0, 3.11 and 4.0 with the patch. Thanks for your efforts! > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Reduce amount of allocations during batch statement execution
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Attachment: screenshot-3.png > Reduce amount of allocations during batch statement execution > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213000#comment-17213000 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~mck], thanks! In Cassandra 3.0. I see in {{BTree}}. {noformat} public static Builder builder(Comparator comparator, int initialCapacity) { return new Builder<>(comparator); } {noformat} that this is basically missing forwarding the provided {{initialCapacity}} in the {{new Builder ...}} call. Not doing that potentially creates the used {{Object[]}} at a too small size resulting in many resizing operations during the life-time of the {{Object[]}}, correct? Once propagating {{initialCapacity}} (added in 3.11+, thus the backport to 3.0.x), we then start to hit CASSANDRA-16201, so I understand we need both for 3.0.x. What I don't understand yet (or perhaps not looked closely enough) is, how {{MultiCBuilder.build()}} could benefit from that, cause it won't call {{BTreeSet.builder}} with any sort of {{initialCapacity}} information, thus falling back to the default {{Object[]}} size of 16. Thanks again. > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, > jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, > jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, > jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212876#comment-17212876 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~mck], while I somehow understand the relation to my recently reported CASSANRA-16201, can you please help me to understand why CASSANDRA-13929 seems to be related? Thanks. > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > Attachments: dashboard.png, jfr_allocations.png, jfr_jmc_2-1.png, > jfr_jmc_2-1_obj.png, jfr_jmc_3-0.png, jfr_jmc_3-0_obj.png, > jfr_jmc_3-0_obj_obj_alloc.png, jfr_jmc_3-11.png, jfr_jmc_3-11_obj.png, > jfr_jmc_4-0-b2.png, jfr_jmc_4-0-b2_obj.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210238#comment-17210238 ] Thomas Steinmaurer edited comment on CASSANDRA-16201 at 10/13/20, 5:06 AM: --- [~marcuse], yes I think so. :-) TRUNK, locally checked out, calling hierarchy from {{BatchUpdatesCollector.getPartitionUpdateBuilder}} up to {{PartitionUpdate.Builder.rowBuilder}} !screenshot-2.png|width=100%! Thanks again. was (Author: tsteinmaurer): [~marcuse], yes I think so. :-) TRUNK, locally checked out, calling hierarchy from {{BatchUpdatesCollector.getPartitionUpdateBuilder}} up to {{PartitionUpdate.Builder.rowBuilder}} !screenshot-2.png! Thanks again. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.11.x, 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Description: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, we have multiple NTR threads in a 3-digit MB range. This is likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. !screenshot-1.png|width=100%! So it seems we have many, many 20K elemnts pre-allocated object arrays resulting in a shallow heap of 80K each, although there is only one element in the array. This sort of pre-allocation is causing a lot of memory pressure. was: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, we have multiple NTR threads in a 3-digit MB range. This is likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. !screenshot-1.png! So it seems we have many, many 20K elemnts pre-allocated object arrays resulting in a shallow heap of 80K each, although there is only one element in the array. This sort of pre-allocation is causing a lot of memory pressure. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.11.x, 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png|width=100%! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212408#comment-17212408 ] Thomas Steinmaurer commented on CASSANDRA-15430: Sent [~mck] a fresh set of JFR files today from our recent 2.1.18 / 3.0.20 / 3.11.8 / 4.0 Beta2 testing. > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210249#comment-17210249 ] Thomas Steinmaurer commented on CASSANDRA-16201: [~dcapwell], code screen above is from local TRUNK, thus not strictly Beta2. [~marcuse] already contacted me via Slack. Thanks for your attention > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 4.0-beta > > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210238#comment-17210238 ] Thomas Steinmaurer edited comment on CASSANDRA-16201 at 10/8/20, 2:14 PM: -- [~marcuse], yes I think so. :-) TRUNK, locally checked out, calling hierarchy from {{BatchUpdatesCollector.getPartitionUpdateBuilder}} up to {{PartitionUpdate.Builder.rowBuilder}} !screenshot-2.png! Thanks again. was (Author: tsteinmaurer): [~marcuse], yes I think so. :-) Locally checked out, calling hierarchy from {{BatchUpdatesCollector.getPartitionUpdateBuilder}} up to {{PartitionUpdate.Builder.rowBuilder}} !screenshot-2.png! Thanks again. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210238#comment-17210238 ] Thomas Steinmaurer commented on CASSANDRA-16201: [~marcuse], yes I think so. :-) Locally checked out, calling hierarchy from {{BatchUpdatesCollector.getPartitionUpdateBuilder}} up to {{PartitionUpdate.Builder.rowBuilder}} !screenshot-2.png! Thanks again. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Attachment: screenshot-2.png > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png, screenshot-2.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Description: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, we have multiple NTR threads in a 3-digit MB range. This is likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. !screenshot-1.png! So it seems we have many, many 20K elemnts pre-allocated object arrays resulting in a shallow heap of 80K each, although there is only one element in the array. This sort of pre-allocation is causing a lot of memory pressure. was: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, we have multiple NTR threads in a 3-digit MB range. This is likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. !screenshot-1.png! So it seems we have many, many 20K elemnts pre-allocated object arrays resulting in a shallow heap of 80K each, although there is only one element in the array. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. > This sort of pre-allocation is causing a lot of memory pressure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Description: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, we have multiple NTR threads in a 3-digit MB range. This is likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. !screenshot-1.png! So it seems we have many, many 20K elemnts pre-allocated object arrays resulting in a shallow heap of 80K each, although there is only one element in the array. was: In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, we have > multiple NTR threads in a 3-digit MB range. > This is likely related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. > !screenshot-1.png! > So it seems we have many, many 20K elemnts pre-allocated object arrays > resulting in a shallow heap of 80K each, although there is only one element > in the array. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
[ https://issues.apache.org/jira/browse/CASSANDRA-16201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16201: --- Attachment: screenshot-1.png > Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations > in BatchUpdatesCollector > - > > Key: CASSANDRA-16201 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: screenshot-1.png > > > In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, > we see 4.0b2 going OOM from time to time. According to a heap dump, likely > related to object array pre-allocations at the size of > {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always > only 1 {{BTreeRow}} in the {{BTree}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16201) Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector
Thomas Steinmaurer created CASSANDRA-16201: -- Summary: Cassandra 4.0 b2 - OOM / memory pressure due to object array pre-allocations in BatchUpdatesCollector Key: CASSANDRA-16201 URL: https://issues.apache.org/jira/browse/CASSANDRA-16201 Project: Cassandra Issue Type: Bug Reporter: Thomas Steinmaurer In a Cas 2.1 / 3.0 / 3.11 / 4.0b2 comparison test with the same load profile, we see 4.0b2 going OOM from time to time. According to a heap dump, likely related to object array pre-allocations at the size of {{BatchUpdatesCollector.updatedRows}} per {{BTree}} although there is always only 1 {{BTreeRow}} in the {{BTree}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208986#comment-17208986 ] Thomas Steinmaurer commented on CASSANDRA-16153: [~brandon.williams], sorry for wasting your time. I have discovered that this is an issue on our side on how we start Cassandra. Feel free to close. > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: debug.log.2020-10-01.0.zip, system.log.2020-10-01.0.zip > > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16153: --- Attachment: debug.log.2020-10-01.0.zip > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: debug.log.2020-10-01.0.zip, system.log.2020-10-01.0.zip > > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17207828#comment-17207828 ] Thomas Steinmaurer commented on CASSANDRA-16153: Sure. Attached [^debug.log.2020-10-01.0.zip] > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: debug.log.2020-10-01.0.zip, system.log.2020-10-01.0.zip > > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205973#comment-17205973 ] Thomas Steinmaurer commented on CASSANDRA-16153: [^system.log.2020-10-01.0.zip] Search for: {noformat} ... INFO [main] 2020-10-01 06:17:53,135 CassandraDaemon.java:507 - Hostname: ip-X-Y-68-230:7000:7001 ... {noformat} > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: system.log.2020-10-01.0.zip > > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16153: --- Attachment: system.log.2020-10-01.0.zip > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: system.log.2020-10-01.0.zip > > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205967#comment-17205967 ] Thomas Steinmaurer commented on CASSANDRA-16153: [~brandon.williams], no. 4 vCores (m5.xlarge). > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
[ https://issues.apache.org/jira/browse/CASSANDRA-16153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-16153: --- Description: Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. {noformat} NAME="Amazon Linux AMI" VERSION="2018.03" ID="amzn" ID_LIKE="rhel fedora" VERSION_ID="2018.03" PRETTY_NAME="Amazon Linux AMI 2018.03" ANSI_COLOR="0;33" CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; {noformat} It seems the Cassandra JVM results in using Parallel GC. {noformat} INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 {noformat} Although {{jvm8-server.options}} is using CMS. {noformat} # # GC SETTINGS # # ### CMS Settings -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSWaitDuration=1 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 -XX:+CMSClassUnloadingEnabled ... {noformat} In Cassandra 3, default has been CMS. So, possibly there is something wrong in reading/processing {{jvm8-server.options}}? was: Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) on Ubuntu 18.04 LTS. It seems the Cassandra JVM results in using Parallel GC. In Cassandra 3, default has been CMS. Digging a bit further, it seems like the {{jvm8-server.options}} resp. {{jvm11-server.options}} files aren't used/processed in e.g. {{cassandra-env.sh}}. E.g. in Cassandra 3.11, here we something like that in {{cassandra-env.sh}}. {noformat} # Read user-defined JVM options from jvm.options file JVM_OPTS_FILE=$CASSANDRA_CONF/jvm.options for opt in `grep "^-" $JVM_OPTS_FILE` do JVM_OPTS="$JVM_OPTS $opt" done {noformat} Can't find something similar in {{cassandra-env.sh}} for Cassandra 4 beta2. > Cassandra 4b2 - JVM options from *.options not read/set > --- > > Key: CASSANDRA-16153 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 > Project: Cassandra > Issue Type: Bug > Components: Local/Scripts >Reporter: Thomas Steinmaurer >Priority: Normal > > Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) in AWS. > {noformat} > NAME="Amazon Linux AMI" > VERSION="2018.03" > ID="amzn" > ID_LIKE="rhel fedora" > VERSION_ID="2018.03" > PRETTY_NAME="Amazon Linux AMI 2018.03" > ANSI_COLOR="0;33" > CPE_NAME="cpe:/o:amazon:linux:2018.03:ga" > HOME_URL="http://aws.amazon.com/amazon-linux-ami/"; > {noformat} > It seems the Cassandra JVM results in using Parallel GC. > {noformat} > INFO [Service Thread] 2020-10-01 00:00:56,233 GCInspector.java:299 - PS > Scavenge GC in 541ms. PS Old Gen: 5152844776 -> 5726724752; > WARN [Service Thread] 2020-10-01 00:00:56,234 GCInspector.java:297 - PS > MarkSweep GC in 1969ms. PS Eden Space: 2111307776 -> 0; PS Old Gen: > 5726724752 -> 2581334376; PS Survivor Space: 363850224 -> 0 > {noformat} > Although {{jvm8-server.options}} is using CMS. > {noformat} > # > # GC SETTINGS # > # > ### CMS Settings > -XX:+UseParNewGC > -XX:+UseConcMarkSweepGC > -XX:+CMSParallelRemarkEnabled > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > -XX:+UseCMSInitiatingOccupancyOnly > -XX:CMSWaitDuration=1 > -XX:+CMSParallelInitialMarkEnabled > -XX:+CMSEdenChunksRecordAlways > ## some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541 > -XX:+CMSClassUnloadingEnabled > ... > {noformat} > In Cassandra 3, default has been CMS. > So, possibly there is something wrong in reading/processing > {{jvm8-server.options}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-16153) Cassandra 4b2 - JVM options from *.options not read/set
Thomas Steinmaurer created CASSANDRA-16153: -- Summary: Cassandra 4b2 - JVM options from *.options not read/set Key: CASSANDRA-16153 URL: https://issues.apache.org/jira/browse/CASSANDRA-16153 Project: Cassandra Issue Type: Bug Components: Local/Scripts Reporter: Thomas Steinmaurer Trying out Cassandra 4 beta 2 with Java 8 (AdoptOpenJDK) on Ubuntu 18.04 LTS. It seems the Cassandra JVM results in using Parallel GC. In Cassandra 3, default has been CMS. Digging a bit further, it seems like the {{jvm8-server.options}} resp. {{jvm11-server.options}} files aren't used/processed in e.g. {{cassandra-env.sh}}. E.g. in Cassandra 3.11, here we something like that in {{cassandra-env.sh}}. {noformat} # Read user-defined JVM options from jvm.options file JVM_OPTS_FILE=$CASSANDRA_CONF/jvm.options for opt in `grep "^-" $JVM_OPTS_FILE` do JVM_OPTS="$JVM_OPTS $opt" done {noformat} Can't find something similar in {{cassandra-env.sh}} for Cassandra 4 beta2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15563) Backport removal of OpenJDK warning log
[ https://issues.apache.org/jira/browse/CASSANDRA-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15563: --- Summary: Backport removal of OpenJDK warning log (was: Backport OpenJDK warning log) > Backport removal of OpenJDK warning log > --- > > Key: CASSANDRA-15563 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15563 > Project: Cassandra > Issue Type: Task >Reporter: Thomas Steinmaurer >Priority: Normal > > As requested on Slack, creating this ticket for a backport of > CASSANDRA-13916, potentially to 2.2 and 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15563) Backport OpenJDK warning log
Thomas Steinmaurer created CASSANDRA-15563: -- Summary: Backport OpenJDK warning log Key: CASSANDRA-15563 URL: https://issues.apache.org/jira/browse/CASSANDRA-15563 Project: Cassandra Issue Type: Task Reporter: Thomas Steinmaurer As requested on Slack, creating this ticket for a backport of CASSANDRA-13916, potentially to 2.2 and 3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022712#comment-17022712 ] Thomas Steinmaurer edited comment on CASSANDRA-15430 at 1/24/20 6:07 AM: - [~benedict], in the previously provided OneDrive Link, I have put another JFR (sub-directory {{full_on_cas_3.0.18}}) where the entire cluster was on 3.0.18, thus any {{LegacyLayout}} related signs in the stack traces should be gone. I see no reason to open another ticket for that, cause it does not change the situation, that the write path churns a lot (compared to 2.1). Thanks! was (Author: tsteinmaurer): [~benedict], in the previously provided OneDrive Link, I have put another JFR where the entire cluster was on 3.0.18, thus any {{LegacyLayout}} related signs in the stack traces should be gone. I see no reason to open another ticket for that, cause it does not change the situation, that the write path churns a lot (compared to 2.1). Thanks! > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022712#comment-17022712 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~benedict], in the previously provided OneDrive Link, I have put another JFR where the entire cluster was on 3.0.18, thus any {{LegacyLayout}} related signs in the stack traces should be gone. I see no reason to open another ticket for that, cause it does not change the situation, that the write path churns a lot (compared to 2.1). Thanks! > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017990#comment-17017990 ] Thomas Steinmaurer edited comment on CASSANDRA-15430 at 1/17/20 1:09 PM: - [~benedict], thanks! Just another internal high-level overview on our 3.0 experience compared to 2.1 when it comes to the write path. !screenshot-4.png! "Timeseries (TS)" written is simply our "payload" (data point) we are ingesting in a 6 node load test environment. While 2.1.18 was able to handle ~ 2 mio TS payloads / min / Cassandra JVM at 3-4% GC suspension and 25% CPU usage on a 64 vCore box (32 physical cores) without dropping mutation messages, 3.0.18 looks much worse. All 3.0.18 based tests in the table have been done without being in a mixed Cassandra binary version scenario. So, any low-hanging fruit would be much appreciated. :-) was (Author: tsteinmaurer): [~benedict], thanks! Just another internal high-level overview on our 3.0 experience compared to 2.1 when it comes to the write patch. !screenshot-4.png! "Timeseries (TS)" written is simply our "payload" (data point) we are ingesting in a 6 node load test environment. While 2.1.18 was able to handle ~ 2 mio TS payloads / min / Cassandra JVM at 3-4% GC suspension and 25% CPU usage on a 64 vCore box (32 physical cores) without dropping mutation messages, 3.0.18 looks much worse. All 3.0.18 based tests in the table have been done without being in a mixed Cassandra binary version scenario. So, any low-hanging fruit would be much appreciated. :-) > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017990#comment-17017990 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~benedict], thanks! Just another internal high-level overview on our 3.0 experience compared to 2.1 when it comes to the write patch. !screenshot-4.png! "Timeseries (TS)" written is simply our "payload" (data point) we are ingesting in a 6 node load test environment. While 2.1.18 was able to handle ~ 2 mio TS payloads / min / Cassandra JVM at 3-4% GC suspension and 25% CPU usage on a 64 vCore box (32 physical cores) without dropping mutation messages, 3.0.18 looks much worse. All 3.0.18 based tests in the table have been done without being in a mixed Cassandra binary version scenario. So, any low-hanging fruit would be much appreciated. :-) > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Attachment: screenshot-4.png > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017955#comment-17017955 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~benedict], as discussed, a few more JFR-based screens to better show the differences, until you find time to open JFR files yourself. Again, it is all about the write path (only) executing batch messages. Both had the same *JFR duration*, namely *15min*. Cassandra *2.1.18*: * BatchMessage.execute - In total: 42,76 GByte with next level top contributors: ** BatchStatement.getMutations => 19,46 GByte ** BatchStatement.executeWithoutConditions => 16,97 GByte !screenshot-1.png! Cassandra *3.0.18*: * BatchMessage.execute - In total: 451,86 GByte (factor 10 more) with next level top contributors: ** BatchStatement.executeWithoutConditions => 214,23 GByte ** BatchStatement.getMutations => 205,52 GByte For *3.0.18*, more in-depth drill-down for BatchMessage.executeWithoutConditions (214,23 GByte) !screenshot-2.png! resp.: BatchMessage.getMutations (205,52 GByte) !screenshot-3.png! A bit hard to give sufficient details with screen shots, so likely it would be simply the best option, to work directly with the provided JFR files. Thanks a lot! > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For a
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Attachment: screenshot-3.png > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Attachment: screenshot-2.png > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png, screenshot-2.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Attachment: screenshot-1.png > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png, > screenshot-1.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different > node, high-level, it looks like the code path underneath > {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in > 3.0.18 compared to 2.1.18. > !jfr_allocations.png! > Left => 3.0.18 > Right => 2.1.18 > JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I > can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016628#comment-17016628 ] Thomas Steinmaurer edited comment on CASSANDRA-15430 at 1/16/20 7:57 AM: - [~benedict], please try to download the JFR files for both 2.1.18 and 3.0.18 here: [https://dynatrace-my.sharepoint.com/:f:/p/thomas_steinmaurer/EoFkdBH-WnlOmuGZ4hL_8PwByBTQLwhtlBGBLW_0y3P9rg?e=uKlr6W] The data model is pretty straightforward originating from Astyanax/Thrift legacy days, moving over to CQL, in a BLOB-centric model, with our client-side "serializer framework". E.g.: {noformat} CREATE TABLE ks."cf" ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (n ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '2'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 259200 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; {noformat} Regarding queries. It is really just about the write path (batch message processing) in Cas 2.1 vs. 3.0 as outlined in the issue description. We have tried single-partition batches vs. multi-partition batches (I know, bad practice), but single-partition batches didn't have a positive impact on the write path in 3.0 either in our tests. Moving from 2.1 to 3.0 would mean for us to add ~ 30-40% more resources to handle the same write load sufficiently. Thanks for any help in that area! was (Author: tsteinmaurer): [~benedict], please try to download the JFR files for both 2.1.18 and 3.0.18 here: [https://dynatrace-my.sharepoint.com/:f:/p/thomas_steinmaurer/EoFkdBH-WnlOmuGZ4hL_8PwByBTQLwhtlBGBLW_0y3P9rg?e=uKlr6W] The data model is pretty straightforward originating from Astyanax/Thrift legacy days, moving over to CQL, in a BLOB-centric model, with our client-side "serializer framework". E.g.: {noformat} CREATE TABLE ks."cf" ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (n ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '2'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 259200 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; {noformat} Regarding queries. It is really just about the write path (batch message processing) in Cas 2.1 vs. 3.0 as outlined in the issue description. We have tried single-partition batches vs. multi-partition batches (I know, bad practice), but single-partition batches didn't have a positive impact on the write path in 3.0 either in our tests. Moving from 2.1 to 3.0 would mean for us to add ~ 30-40% more resources to handle the same load sufficiently. Thanks for any help in that area! > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Followi
[jira] [Commented] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016628#comment-17016628 ] Thomas Steinmaurer commented on CASSANDRA-15430: [~benedict], please try to download the JFR files for both 2.1.18 and 3.0.18 here: [https://dynatrace-my.sharepoint.com/:f:/p/thomas_steinmaurer/EoFkdBH-WnlOmuGZ4hL_8PwByBTQLwhtlBGBLW_0y3P9rg?e=uKlr6W] The data model is pretty straightforward originating from Astyanax/Thrift legacy days, moving over to CQL, in a BLOB-centric model, with our client-side "serializer framework". E.g.: {noformat} CREATE TABLE ks."cf" ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE AND CLUSTERING ORDER BY (n ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '2'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND dclocal_read_repair_chance = 0.0 AND default_time_to_live = 0 AND gc_grace_seconds = 259200 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = 'NONE'; {noformat} Regarding queries. It is really just about the write path (batch message processing) in Cas 2.1 vs. 3.0 as outlined in the issue description. We have tried single-partition batches vs. multi-partition batches (I know, bad practice), but single-partition batches didn't have a positive impact on the write path in 3.0 either in our tests. Moving from 2.1 to 3.0 would mean for us to add ~ 30-40% more resources to handle the same load sufficiently. Thanks for any help in that area! > Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations > compared to 2.1.18 > > > Key: CASSANDRA-15430 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png > > > In a 6 node loadtest cluster, we have been running with 2.1.18 a certain > production-like workload constantly and sufficiently. After upgrading one > node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of > regression described below), 3.0.18 is showing increased CPU usage, increase > GC, high mutation stage pending tasks, dropped mutation messages ... > Some spec. All 6 nodes equally sized: > * Bare metal, 32 physical cores, 512G RAM > * Xmx31G, G1, max pause millis = 2000ms > * cassandra.yaml basically unchanged, thus same settings in regard to number > of threads, compaction throttling etc. > Following dashboard shows highlighted areas (CPU, suspension) with metrics > for all 6 nodes and the one outlier being the node upgraded to Cassandra > 3.0.18. > !dashboard.png|width=1280! > Additionally we see a large increase on pending tasks in the mutation stage > after the upgrade: > !mutation_stage.png! > And dropped mutation messages, also confirmed in the Cassandra log: > {noformat} > INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - > MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout > and 0 for cross node timeout > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool > NameActive Pending Completed Blocked All Time > Blocked > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > MutationStage 256 81824 3360532756 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ViewMutationStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - > ReadStage 0 0 62862266 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > RequestResponseStage 0 0 2176659856 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > ReadRepairStage 0 0 0 0 > 0 > INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - > CounterMutationStage 0 0 0 0 > 0 > ... > {noformat} > Judging from a 15min JFR session for
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Description: In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ... Some spec. All 6 nodes equally sized: * Bare metal, 32 physical cores, 512G RAM * Xmx31G, G1, max pause millis = 2000ms * cassandra.yaml basically unchanged, thus same settings in regard to number of threads, compaction throttling etc. Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18. !dashboard.png|width=1280! Additionally we see a large increase on pending tasks in the mutation stage after the upgrade: !mutation_stage.png! And dropped mutation messages, also confirmed in the Cassandra log: {noformat} INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage 256 81824 3360532756 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage 0 0 62862266 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage 0 0 2176659856 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - ReadRepairStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - CounterMutationStage 0 0 0 0 0 ... {noformat} Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18. !jfr_allocations.png! Left => 3.0.18 Right => 2.1.18 JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available. was: In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ... Some spec. All 6 nodes equally sized: * Bare metal, 32 physical cores, 512G RAM * Xmx31G, G1, max pause millis = 2000ms * cassandra.yaml basically unchanged, thus some settings in regard to number of threads, compaction throttling etc. Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18. !dashboard.png|width=1280! Additionally we see a large increase on pending tasks in the mutation stage after the upgrade: !mutation_stage.png! And dropped mutation messages, also confirmed in the Cassandra log: {noformat} INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage 256 81824 3360532756 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage 0 0 62862266 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage 0 0 2176659856
[jira] [Created] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
Thomas Steinmaurer created CASSANDRA-15430: -- Summary: Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18 Key: CASSANDRA-15430 URL: https://issues.apache.org/jira/browse/CASSANDRA-15430 Project: Cassandra Issue Type: Bug Reporter: Thomas Steinmaurer Attachments: dashboard.png, jfr_allocations.png, mutation_stage.png In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ... Some spec. All 6 nodes equally sized: * Bare metal, 32 physical cores, 512G RAM * Xmx31G, G1, max pause millis = 2000ms * cassandra.yaml basically unchanged, thus some settings in regard to number of threads, compaction throttling etc. Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the outlier being the node upgraded to Cassandra 3.0.18. !dashboard.png|width=1280! Additionally we see a large increase on pending tasks in the mutation stage after the upgrade: !mutation_stage.png! And dropped mutation messages, also confirmed in the Cassandra log: {noformat} INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage 256 81824 3360532756 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage 0 0 62862266 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage 0 0 2176659856 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - ReadRepairStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - CounterMutationStage 0 0 0 0 0 ... {noformat} Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18. !jfr_allocations.png! Left => 3.0.18 Right => 2.1.18 JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15430) Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18
[ https://issues.apache.org/jira/browse/CASSANDRA-15430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15430: --- Description: In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ... Some spec. All 6 nodes equally sized: * Bare metal, 32 physical cores, 512G RAM * Xmx31G, G1, max pause millis = 2000ms * cassandra.yaml basically unchanged, thus some settings in regard to number of threads, compaction throttling etc. Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18. !dashboard.png|width=1280! Additionally we see a large increase on pending tasks in the mutation stage after the upgrade: !mutation_stage.png! And dropped mutation messages, also confirmed in the Cassandra log: {noformat} INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage 256 81824 3360532756 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage 0 0 62862266 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage 0 0 2176659856 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - ReadRepairStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - CounterMutationStage 0 0 0 0 0 ... {noformat} Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath {{BatchMessage.execute}} is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18. !jfr_allocations.png! Left => 3.0.18 Right => 2.1.18 JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available. was: In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ... Some spec. All 6 nodes equally sized: * Bare metal, 32 physical cores, 512G RAM * Xmx31G, G1, max pause millis = 2000ms * cassandra.yaml basically unchanged, thus some settings in regard to number of threads, compaction throttling etc. Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the outlier being the node upgraded to Cassandra 3.0.18. !dashboard.png|width=1280! Additionally we see a large increase on pending tasks in the mutation stage after the upgrade: !mutation_stage.png! And dropped mutation messages, also confirmed in the Cassandra log: {noformat} INFO [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout INFO [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool NameActive Pending Completed Blocked All Time Blocked INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage 256 81824 3360532756 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage 0 0 62862266 0 0 INFO [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage 0 0 2176659856 0
[jira] [Updated] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15400: --- Attachment: oldgen_increase_nov12.jpg > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.20, 3.11.6, 4.0 > > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png, image.png, > oldgen_increase_nov12.jpg > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222 > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadata.png! > With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 > MByte each > We have been running with 2.1.18 for > 3 years and I can't remember dealing > with such OOM in the context of extending a cluster. > While the MAT screens above are from our production cluster, we partly can > reproduce this behavior in our loadtest environment (although not going full > OOM there), thus I might be able to share a hprof from this non-prod > environment if needed. > Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973109#comment-16973109 ] Thomas Steinmaurer commented on CASSANDRA-15400: [~bdeggleston], thanks for the follow-up. Yes, causing quite some pain in prod in the moment, e.g. yesterday evening, close to running OOM again. !oldgen_increase_nov12.jpg! > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.20, 3.11.6, 4.0 > > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png, image.png, > oldgen_increase_nov12.jpg > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222 > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadata.png! > With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 > MByte each > We have been running with 2.1.18 for > 3 years and I can't remember dealing > with such OOM in the context of extending a cluster. > While the MAT screens above are from our production cluster, we partly can > reproduce this behavior in our loadtest environment (although not going full > OOM there), thus I might be able to share a hprof from this non-prod > environment if needed. > Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970154#comment-16970154 ] Thomas Steinmaurer commented on CASSANDRA-15400: [~bdeggleston], from ticket creation to a patch in ~ 24h. This is awesome! Many thanks. * Out of curiosity, haven't looked too deep. I guess the patch does not make the content of the byte array smaller, but the capacity of the byte array basically in-sync with that and not 1MByte in general? * Secondly, as 3.0.19 was released just recently, any ETA when a 3.0.20 public release might be available? Again, many thanks. > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.20, 3.11.6, 4.0 > > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222 > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadata.png! > With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 > MByte each > We have been running with 2.1.18 for > 3 years and I can't remember dealing > with such OOM in the context of extending a cluster. > While the MAT screens above are from our production cluster, we partly can > reproduce this behavior in our loadtest environment (although not going full > OOM there), thus I might be able to share a hprof from this non-prod > environment if needed. > Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15400: --- Description: We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped. Running in AWS: * m5.2xlarge, EBS SSD (gp2) * Xms/Xmx12G, Xmn3G, CMS GC, OpenJDK8u222 * 4 compaction threads, throttling set to 32 MB/s What we see is a steady increase in the OLD gen over many hours. !cassandra_jvm_metrics.png! * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests !cassandra_operationcount.png! Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased. We see a correlation in increased number of SSTables and pending compactions. !cassandra_sstables_pending_compactions.png! Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then. This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained heap in total. !cassandra_hprof_dominator_classes.png! Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size. !cassandra_hprof_bigtablereader_statsmetadata.png! With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster. While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed. Thanks a lot. was: We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped. Running in AWS: * m5.2xlarge, EBS SSD (gp2) * Xms/Xmx12G, Xmn3G, CMS GC * 4 compaction threads, throttling set to 32 MB/s What we see is a steady increase in the OLD gen over many hours. !cassandra_jvm_metrics.png! * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests !cassandra_operationcount.png! Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased. We see a correlation in increased number of SSTables and pending compactions. !cassandra_sstables_pending_compactions.png! Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then. This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained heap in total. !cassandra_hprof_dominator_classes.png! Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size. !cassandra_hprof_bigtablereader_statsmetadata.png! With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster. While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed. Thanks a lot. > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an
[jira] [Comment Edited] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968762#comment-16968762 ] Thomas Steinmaurer edited comment on CASSANDRA-15400 at 11/6/19 10:18 PM: -- [~marcuse], the data model has evolved starting with Astyanax/Thrift moved over to pure CQL3 access (without real data migration), but still with our own application-side serializer framework, working with byte buffers, thus BLOBs on the data model side. Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, where we also see the majority of increasing number of pending compaction tasks, according to a per-CF JMX based self-monitoring: {noformat} CREATE TABLE ks.cf1 ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE ... ; {noformat} Although we tend to also have single partitions in the area of > 100MByte, e.g. visible due to according compaction logs in the Cassandra log, all not being a real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1. A few additional thoughts: * Likely the Cassandra node is utilizing most of the compaction threads (4 in this scenario with the m5.2xlarge instance type) with larger compactions on streamed data, giving less room for compactions of live data / actual writes while being in UJ, resulting in accessing much more smaller SSTables (looks like we have/had plenty in the area of 10-50MByte) then in UN starting to serve read requests * Is there anything known in Cas 3.0, which might result in streaming more data from other nodes compared to 2.1 resulting in increased compaction work to be done for newly joined nodes compared to 2.1 * Is there anything known in Cas 3.0, which results in more frequent memtable flushes compared to 2.1, again resulting in increased compaction work * Talking about a single {{BigTableReader}} instance again, did anything change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in the hprof? Looks to me we potentially waste quite some on-heap memory here !cassandra_hprof_statsmetadata.png|width=800! * Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap first resulting in the 1MByte allocation, reminding me a bit on the NIO cache buffer bug (https://support.datastax.com/hc/en-us/articles/36863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache), with a recommendation setting it to exactly the number (-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap byte buffer Number of compaction threads, compaction throttling is unchanged during the upgrade from 2.1 to 3.0 and if memory serves me well, we should see improved compaction throughput in 3.0 with the same throttling settings anyway. was (Author: tsteinmaurer): [~marcuse], the data model has evolved starting with Astyanax/Thrift moved over to pure CQL3 access (without real data migration), but still with our own application-side serializer framework, working with byte buffers, thus BLOBs on the data model side. Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, where we also see the majority of increasing number of pending compaction tasks, according to a per-CF JMX based self-monitoring: {noformat} CREATE TABLE ks.cf1 ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE ... ; {noformat} Although we tend to also have single partitions in the area of > 100MByte, e.g. visible due to according compaction logs in the Cassandra log, all not being a real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1. A few additional thoughts: * Likely the Cassandra node is utilizing most of the compaction threads (4 in this scenario with the m5.2xlarge instance type) with larger compactions on streamed data, giving less room for compactions of live data / actual writes while being in UJ, resulting in accessing much more smaller SSTables (looks like we have/had plenty in the area of 10-50MByte) then in UN starting to serve read requests * Is there anything known in Cas 3.0, which might result in streaming more data from other nodes compared to 2.1 resulting in increased compaction work to be done for newly joined nodes compared to 2.1 * Is there anything known in Cas 3.0, which results in more frequent memtable flushes compared to 2.1, again resulting in increased compaction work * Talking about a single {{BigTableReader}} instance again, did anything change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in the hprof? Looks to me we potentially waste quite some on-heap memory here !cassandra_hprof_statsmetadata.png|width=800! *
[jira] [Commented] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968762#comment-16968762 ] Thomas Steinmaurer commented on CASSANDRA-15400: [~marcuse], the data model has evolved starting with Astyanax/Thrift moved over to pure CQL3 access (without real data migration), but still with our own application-side serializer framework, working with byte buffers, thus BLOBs on the data model side. Our high volume (usually > 1TByte per node, RF=3) CF/table looks like that, where we also see the majority of increasing number of pending compaction tasks, according to a per-CF JMX based self-monitoring: {noformat} CREATE TABLE ks.cf1 ( k blob, n blob, v blob, PRIMARY KEY (k, n) ) WITH COMPACT STORAGE ... ; {noformat} Although we tend to also have single partitions in the area of > 100MByte, e.g. visible due to according compaction logs in the Cassandra log, all not being a real problem in practice with a heap of Xms/Xmx12G resp Xmn3G and Cas 2.1. A few additional thoughts: * Likely the Cassandra node is utilizing most of the compaction threads (4 in this scenario with the m5.2xlarge instance type) with larger compactions on streamed data, giving less room for compactions of live data / actual writes while being in UJ, resulting in accessing much more smaller SSTables (looks like we have/had plenty in the area of 10-50MByte) then in UN starting to serve read requests * Is there anything known in Cas 3.0, which might result in streaming more data from other nodes compared to 2.1 resulting in increased compaction work to be done for newly joined nodes compared to 2.1 * Is there anything known in Cas 3.0, which results in more frequent memtable flushes compared to 2.1, again resulting in increased compaction work * Talking about a single {{BigTableReader}} instance again, did anything change in regard to byte buffer pre-allocation at 1MByte in {{StatsMetadata}} per data member {{minClusteringValues}} and {{maxClusteringValues}} as shown in the hprof? Looks to me we potentially waste quite some on-heap memory here !cassandra_hprof_statsmetadata.png|width=800! * Is {{StatsMetadata}} purely on-heap? Or is it somehow pulled from off-heap first resulting in the 1MByte allocation, reminding me a bit on the NIO cache buffer bug (https://support.datastax.com/hc/en-us/articles/36863663-JVM-OOM-direct-buffer-errors-affected-by-unlimited-java-nio-cache), with a recommendation setting it to exactly the number (-Djdk.nio.maxCachedBufferSize=1048576) we see in the hprof for the on-heap byte buffer Number of compaction threads, compaction throttling is unchanged during the upgrade from 2.1 to 3.0. > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadat
[jira] [Updated] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15400: --- Attachment: cassandra_hprof_statsmetadata.png > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Assignee: Blake Eggleston >Priority: Normal > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_hprof_statsmetadata.png, > cassandra_jvm_metrics.png, cassandra_operationcount.png, > cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS: > * m5.2xlarge, EBS SSD (gp2) > * Xms/Xmx12G, Xmn3G, CMS GC > * 4 compaction threads, throttling set to 32 MB/s > What we see is a steady increase in the OLD gen over many hours. > !cassandra_jvm_metrics.png! > * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 > * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct > 31 ~ 07:00 also starting to be a member of serving client read requests > !cassandra_operationcount.png! > Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage > constantly increased. > We see a correlation in increased number of SSTables and pending compactions. > !cassandra_sstables_pending_compactions.png! > Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra > startup (metric gap in the chart above), number of SSTables + pending > compactions is still high, but without facing memory troubles since then. > This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K > BigTableReader instances with ~ 8.7GByte retained heap in total. > !cassandra_hprof_dominator_classes.png! > Having a closer look on a single object instance, seems like each instance is > ~ 2MByte in size. > !cassandra_hprof_bigtablereader_statsmetadata.png! > With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 > MByte each > We have been running with 2.1.18 for > 3 years and I can't remember dealing > with such OOM in the context of extending a cluster. > While the MAT screens above are from our production cluster, we partly can > reproduce this behavior in our loadtest environment (although not going full > OOM there), thus I might be able to share a hprof from this non-prod > environment if needed. > Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-15400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-15400: --- Description: We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped. Running in AWS: * m5.2xlarge, EBS SSD (gp2) * Xms/Xmx12G, Xmn3G, CMS GC * 4 compaction threads, throttling set to 32 MB/s What we see is a steady increase in the OLD gen over many hours. !cassandra_jvm_metrics.png! * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests !cassandra_operationcount.png! Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased. We see a correlation in increased number of SSTables and pending compactions. !cassandra_sstables_pending_compactions.png! Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then. This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained heap in total. !cassandra_hprof_dominator_classes.png! Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size. !cassandra_hprof_bigtablereader_statsmetadata.png! With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster. While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed. Thanks a lot. was: We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped. Running in AWS: * m5.2xlarge, EBS SSD (gp2) * Xms/Xmx12G, Xmn3G, CMS GC * 4 compaction threads, throttling set to 32 MB/s What we see is a steady increase in the OLD gen over many hours. !cassandra_jvm_metrics.png! * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests !cassandra_operationcount.png! Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased. We see a correlation in increased number of SSTables and pending compactions. !cassandra_sstables_pending_compactions.png! Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then. This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained hype in total. !cassandra_hprof_dominator_classes.png! Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size. !cassandra_hprof_bigtablereader_statsmetadata.png! With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster. While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed. Thanks a lot. > Cassandra 3.0.18 went OOM several hours after joining a cluster > --- > > Key: CASSANDRA-15400 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Normal > Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, > cassandra_hprof_dominator_classes.png, cassandra_jvm_metrics.png, > cassandra_operationcount.png, cassandra_sstables_pending_compactions.png > > > We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been > facing an OOM two times with 3.0.18 on newly added nodes joining an existing > cluster after several hours being successfully bootstrapped. > Running in AWS:
[jira] [Created] (CASSANDRA-15400) Cassandra 3.0.18 went OOM several hours after joining a cluster
Thomas Steinmaurer created CASSANDRA-15400: -- Summary: Cassandra 3.0.18 went OOM several hours after joining a cluster Key: CASSANDRA-15400 URL: https://issues.apache.org/jira/browse/CASSANDRA-15400 Project: Cassandra Issue Type: Bug Reporter: Thomas Steinmaurer Attachments: cassandra_hprof_bigtablereader_statsmetadata.png, cassandra_hprof_dominator_classes.png, cassandra_jvm_metrics.png, cassandra_operationcount.png, cassandra_sstables_pending_compactions.png We have been moving from Cassandra 2.1.18 to Cassandra 3.0.18 and have been facing an OOM two times with 3.0.18 on newly added nodes joining an existing cluster after several hours being successfully bootstrapped. Running in AWS: * m5.2xlarge, EBS SSD (gp2) * Xms/Xmx12G, Xmn3G, CMS GC * 4 compaction threads, throttling set to 32 MB/s What we see is a steady increase in the OLD gen over many hours. !cassandra_jvm_metrics.png! * The node started to join / auto-bootstrap the cluster on Oct 30 ~ 12:00 * It basically finished joining the cluster (UJ => UN) ~ 19hrs later on Oct 31 ~ 07:00 also starting to be a member of serving client read requests !cassandra_operationcount.png! Memory-wise (on-heap) it didn't look that bad at that time, but old gen usage constantly increased. We see a correlation in increased number of SSTables and pending compactions. !cassandra_sstables_pending_compactions.png! Until we reached the OOM somewhere in Nov 1 in the night. After a Cassandra startup (metric gap in the chart above), number of SSTables + pending compactions is still high, but without facing memory troubles since then. This correlation is confirmed by the auto-generated heap dump with e.g. ~ 5K BigTableReader instances with ~ 8.7GByte retained hype in total. !cassandra_hprof_dominator_classes.png! Having a closer look on a single object instance, seems like each instance is ~ 2MByte in size. !cassandra_hprof_bigtablereader_statsmetadata.png! With 2 pre-allocated byte buffers (highlighted in the screen above) at 1 MByte each We have been running with 2.1.18 for > 3 years and I can't remember dealing with such OOM in the context of extending a cluster. While the MAT screens above are from our production cluster, we partly can reproduce this behavior in our loadtest environment (although not going full OOM there), thus I might be able to share a hprof from this non-prod environment if needed. Thanks a lot. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14691) Cassandra 2.1 backport - The JVM should exit if jmx fails to bind
[ https://issues.apache.org/jira/browse/CASSANDRA-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610483#comment-16610483 ] Thomas Steinmaurer commented on CASSANDRA-14691: Well, sure, but how does this ticket about corruption e.g. compares to CASSANDRA-14284 included in 2.1.21 (corruption vs. crash)? Thought there might be e.g. 2.1.22 anyhow ... Anyway. I will now stop bothering. :-) > Cassandra 2.1 backport - The JVM should exit if jmx fails to bind > - > > Key: CASSANDRA-14691 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14691 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Priority: Major > Labels: lhf > Fix For: 2.1.x > > > If you are already running a cassandra instance, but for some reason try to > start another one, this happens: > {noformat} > INFO 20:57:09 JNA mlockall successful > WARN 20:57:09 JMX is not enabled to receive remote connections. Please see > cassandra-env.sh for more info. > ERROR 20:57:10 Error starting local jmx server: > java.rmi.server.ExportException: Port already in use: 7199; nested exception > is: > java.net.BindException: Address already in use > at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:340) > ~[na:1.7.0_76] > at > sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:248) > ~[na:1.7.0_76] > at > sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411) > ~[na:1.7.0_76] > at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147) > ~[na:1.7.0_76] > at > sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:207) > ~[na:1.7.0_76] > at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:122) > ~[na:1.7.0_76] > at sun.rmi.registry.RegistryImpl.(RegistryImpl.java:98) > ~[na:1.7.0_76] > at > java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:239) > ~[na:1.7.0_76] > at > org.apache.cassandra.service.CassandraDaemon.maybeInitJmx(CassandraDaemon.java:100) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:653) > [main/:na] > Caused by: java.net.BindException: Address already in use > at java.net.PlainSocketImpl.socketBind(Native Method) ~[na:1.7.0_76] > at > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376) > ~[na:1.7.0_76] > at java.net.ServerSocket.bind(ServerSocket.java:376) ~[na:1.7.0_76] > at java.net.ServerSocket.(ServerSocket.java:237) ~[na:1.7.0_76] > at > javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:231) > ~[na:1.7.0_76] > at > org.apache.cassandra.utils.RMIServerSocketFactoryImpl.createServerSocket(RMIServerSocketFactoryImpl.java:13) > ~[main/:na] > at > sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666) > ~[na:1.7.0_76] > at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:329) > ~[na:1.7.0_76] > ... 11 common frames omitted > {noformat} > However the startup continues, and ends up replaying commitlogs, which is > probably not a good thing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14691) Cassandra 2.1 backport - The JVM should exit if jmx fails to bind
[ https://issues.apache.org/jira/browse/CASSANDRA-14691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610432#comment-16610432 ] Thomas Steinmaurer commented on CASSANDRA-14691: [~spo...@gmail.com], thanks for the feedback. So, potential corruption caused by this does not qualify as critical? > Cassandra 2.1 backport - The JVM should exit if jmx fails to bind > - > > Key: CASSANDRA-14691 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14691 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Thomas Steinmaurer >Priority: Major > Labels: lhf > Fix For: 2.1.x > > > If you are already running a cassandra instance, but for some reason try to > start another one, this happens: > {noformat} > INFO 20:57:09 JNA mlockall successful > WARN 20:57:09 JMX is not enabled to receive remote connections. Please see > cassandra-env.sh for more info. > ERROR 20:57:10 Error starting local jmx server: > java.rmi.server.ExportException: Port already in use: 7199; nested exception > is: > java.net.BindException: Address already in use > at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:340) > ~[na:1.7.0_76] > at > sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:248) > ~[na:1.7.0_76] > at > sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:411) > ~[na:1.7.0_76] > at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147) > ~[na:1.7.0_76] > at > sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:207) > ~[na:1.7.0_76] > at sun.rmi.registry.RegistryImpl.setup(RegistryImpl.java:122) > ~[na:1.7.0_76] > at sun.rmi.registry.RegistryImpl.(RegistryImpl.java:98) > ~[na:1.7.0_76] > at > java.rmi.registry.LocateRegistry.createRegistry(LocateRegistry.java:239) > ~[na:1.7.0_76] > at > org.apache.cassandra.service.CassandraDaemon.maybeInitJmx(CassandraDaemon.java:100) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:222) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:564) > [main/:na] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:653) > [main/:na] > Caused by: java.net.BindException: Address already in use > at java.net.PlainSocketImpl.socketBind(Native Method) ~[na:1.7.0_76] > at > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376) > ~[na:1.7.0_76] > at java.net.ServerSocket.bind(ServerSocket.java:376) ~[na:1.7.0_76] > at java.net.ServerSocket.(ServerSocket.java:237) ~[na:1.7.0_76] > at > javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:231) > ~[na:1.7.0_76] > at > org.apache.cassandra.utils.RMIServerSocketFactoryImpl.createServerSocket(RMIServerSocketFactoryImpl.java:13) > ~[main/:na] > at > sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:666) > ~[na:1.7.0_76] > at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:329) > ~[na:1.7.0_76] > ... 11 common frames omitted > {noformat} > However the startup continues, and ends up replaying commitlogs, which is > probably not a good thing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject increment repair and allow full repair only
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Description: We are running Cassandra in AWS and On-Premise at customer sites, currently 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time we end up in incremental repairs being enabled / ran a first time unintentionally, cause: a) A lot of online resources / examples do not use the _-full_ command-line option available since 2.2 (?) b) Our internal (support) tickets of course also state nodetool repair command without the -full option, as these examples are for 2.1 Especially for On-Premise customers (with less control than with our AWS deployments), this asks a bit for getting out-of-control once we have 3.11 out and nodetool repair being run without the -full command-line option. With troubles incremental repair are introducing and incremental being the default since 2.2 (?), what do you think about a JVM system property, cassandra.yaml setting or whatever … to basically let the cluster administrator chose if incremental repairs are allowed or not? I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough. was: We are running Cassandra in AWS and On-Premise at customer sites, currently 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time we end up in incremental repairs being enabled / ran a first time unintentionally, cause: a) A lot of online resources / examples do not use the -full command-line option b) Our internal (support) tickets of course also state nodetool repair command without the -full option, as these examples are for 2.1 Especially for On-Premise customers (with less control than with our AWS deployments), this asks a bit for getting out-of-control once we have 3.11 out and nodetool repair being run without the -full command-line option. With troubles incremental repair are introducing and incremental being the default since 2.2 (?), what do you think about a JVM system property, cassandra.yaml setting or whatever … to basically let the cluster administrator chose if incremental repairs are allowed or not? I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough. > Global configuration parameter to reject increment repair and allow full > repair only > > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Major > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0.x > > > We are running Cassandra in AWS and On-Premise at customer sites, currently > 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. > In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time > we end up in incremental repairs being enabled / ran a first time > unintentionally, cause: > a) A lot of online resources / examples do not use the _-full_ command-line > option available since 2.2 (?) > b) Our internal (support) tickets of course also state nodetool repair > command without the -full option, as these examples are for 2.1 > Especially for On-Premise customers (with less control than with our AWS > deployments), this asks a bit for getting out-of-control once we have 3.11 > out and nodetool repair being run without the -full command-line option. > With troubles incremental repair are introducing and incremental being the > default since 2.2 (?), what do you think about a JVM system property, > cassandra.yaml setting or whatever … to basically let the cluster > administrator chose if incremental repairs are allowed or not? I know, such a > flag still can be flipped then (by the customer), but as a first safety stage > possibly sufficient enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14709) Global configuration parameter to reject increment repair and allow full repair only
[ https://issues.apache.org/jira/browse/CASSANDRA-14709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Steinmaurer updated CASSANDRA-14709: --- Description: We are running Cassandra in AWS and On-Premise at customer sites, currently 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time we end up in incremental repairs being enabled / ran a first time unintentionally, cause: a) A lot of online resources / examples do not use the -full command-line option b) Our internal (support) tickets of course also state nodetool repair command without the -full option, as these examples are for 2.1 Especially for On-Premise customers (with less control than with our AWS deployments), this asks a bit for getting out-of-control once we have 3.11 out and nodetool repair being run without the -full command-line option. With troubles incremental repair are introducing and incremental being the default since 2.2 (?), what do you think about a JVM system property, cassandra.yaml setting or whatever … to basically let the cluster administrator chose if incremental repairs are allowed or not? I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough. was: We are running Cassandra in AWS and On-Premise at customer sites, currently 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time we end up in incremental repairs being enabled / ran a first time unintentionally, cause: a) A lot of online resources / examples do not use the -full command-line option b) Our internal (support) tickets of course also state nodetool repair command without the -full option, as these are for 2.1 Especially for On-Premise customers (with less control than with our AWS deployments), this asks a bit for getting out-of-control once we have 3.11 out and nodetool repair being run without the -full command-line option. With troubles incremental repair are introducing and incremental being the default since 2.2 (?), what do you think about a JVM system property, cassandra.yaml setting or whatever … to basically let the cluster administrator chose if incremental repairs are allowed or not? I know, such a flag still can be flipped then (by the customer), but as a first safety stage possibly sufficient enough. > Global configuration parameter to reject increment repair and allow full > repair only > > > Key: CASSANDRA-14709 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14709 > Project: Cassandra > Issue Type: Bug >Reporter: Thomas Steinmaurer >Priority: Major > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0.x > > > We are running Cassandra in AWS and On-Premise at customer sites, currently > 2.1 in production with 3.0/3.11 in pre-production stages including loadtest. > In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in time > we end up in incremental repairs being enabled / ran a first time > unintentionally, cause: > a) A lot of online resources / examples do not use the -full command-line > option > b) Our internal (support) tickets of course also state nodetool repair > command without the -full option, as these examples are for 2.1 > Especially for On-Premise customers (with less control than with our AWS > deployments), this asks a bit for getting out-of-control once we have 3.11 > out and nodetool repair being run without the -full command-line option. > With troubles incremental repair are introducing and incremental being the > default since 2.2 (?), what do you think about a JVM system property, > cassandra.yaml setting or whatever … to basically let the cluster > administrator chose if incremental repairs are allowed or not? I know, such a > flag still can be flipped then (by the customer), but as a first safety stage > possibly sufficient enough. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org