[jira] [Updated] (CASSANDRA-9739) Migrate counter-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp updated CASSANDRA-9739: Resolution: Won't Fix Status: Resolved (was: Open) > Migrate counter-cache to be fully off-heap > -- > > Key: CASSANDRA-9739 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9739 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Core >Reporter: Robert Stupp >Assignee: Robert Stupp >Priority: Normal > Fix For: 4.x > > > Counter cache still uses a concurrent map on-heap. This could go to off-heap > and feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15803) Separate out allow filtering scanning through a partition versus scanning over the table
[ https://issues.apache.org/jira/browse/CASSANDRA-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Hanna updated CASSANDRA-15803: - Description: Currently allow filtering can mean two things in the spirit of "avoid operations that don't seek to a specific row or sequential rows of data." First, it can mean scanning across the entire table to meet the criteria of the query. That's almost always a bad thing and should be discouraged or disabled (see CASSANDRA-8303). Second, it can mean filtering within a specific partition. For example, in a query you could specify the full partition key and if you specify a criterion on a non-key field, it requires allow filtering. The second reason to require allow filtering is significantly less work to scan through a partition. It is still extra work over seeking to a specific row and getting N sequential rows though. So while an application developer and/or operator needs to be cautious about this second type, it's not necessarily a bad thing, depending on the table and the use case. I propose that we separate the way to specify allow filtering across an entire table from specifying allow filtering across a partition in a backwards compatible way. One idea that was brought up in Slack in the cassandra-dev room was to have allow filtering mean the superset - scanning across the table. Then if you want to specify that you *only* want to scan within a partition you would use something like {{ALLOW FILTERING [WITHIN PARTITION]}} So it will succeed if you specify non-key criteria within a single partition, but fail with a message to say it requires the full allow filtering. This would allow for a backwards compatible full allow filtering while allowing a user to specify that they want to just scan within a partition, but error out if trying to scan a full table. This is potentially also related to the capability limitation framework by which operators could more granularly specify what features are allowed or disallowed per user, discussed in CASSANDRA-8303. This way an operator could disallow the more general allow filtering while allowing the partition scan (or disallow them both at their discretion). was: Currently allow filtering can mean two things in the spirit of "avoid operations that don't seek to a specific row or sequential rows of data." First, it can mean scanning across the entire table to meet the criteria of the query. That's almost always a bad thing and should be discouraged or disabled (see CASSANDRA-8303). Second, it can mean filtering within a specific partition. For example, in a query you could specify the full partition key and if you specify a criterion on a non-key field, it requires allow filtering. The second reason to require allow filtering is significantly less work to scan through a partition. It is still extra work over seeking to a specific row and getting N sequential rows though. So while an application developer and/or operator needs to be cautious about this second type, it's not necessarily a bad thing, depending on the table and the use case. I propose that we separate the way to specify allow filtering across an entire table (involving a scatter gather) from specifying allow filtering across a partition in a backwards compatible way. One idea that was brought up in Slack in the cassandra-dev room was to have allow filtering mean the superset - scanning across the table. Then if you want to specify that you *only* want to scan within a partition you would use something like {{ALLOW FILTERING [WITHIN PARTITION]}} So it will succeed if you specify non-key criteria within a single partition, but fail with a message to say it requires the full allow filtering. This would allow for a backwards compatible full allow filtering while allowing a user to specify that they want to just scan within a partition, but error out if trying to scan a full table. This is potentially also related to the capability limitation framework by which operators could more granularly specify what features are allowed or disallowed per user, discussed in CASSANDRA-8303. This way an operator could disallow the more general allow filtering while allowing the partition scan (or disallow them both at their discretion). > Separate out allow filtering scanning through a partition versus scanning > over the table > > > Key: CASSANDRA-15803 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15803 > Project: Cassandra > Issue Type: Improvement > Components: CQL/Syntax >Reporter: Jeremy Hanna >Priority: Normal > > Currently allow filtering can mean two things in the spirit of "avoid > operations that don't seek to a specific row or sequ
[jira] [Comment Edited] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152460#comment-17152460 ] Jeremy Hanna edited comment on CASSANDRA-13701 at 7/7/20, 3:28 AM: --- Can we also standardize the tests to use the default values - that is, from 32 to the new defaults (16 {{num_tokens}} with {{allocate_tokens_for_local_replication_factor=3}} uncommented). was (Author: jeromatron): Can we also standardize the tests to use the default values - that is, from 32 to the new defaults (16 {{num_tokens}} with {{allocate_tokens_for_local_replication_factor=3}} uncommented. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Jeremy Hanna >Priority: Low > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152460#comment-17152460 ] Jeremy Hanna commented on CASSANDRA-13701: -- Can we also standardize the tests to use the default values - that is, from 32 to the new defaults (16 {{num_tokens}} with {{allocate_tokens_for_local_replication_factor=3}} uncommented. > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Jeremy Hanna >Priority: Low > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13701) Lower default num_tokens
[ https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Hanna updated CASSANDRA-13701: - Test and Documentation Plan: Associated documentation about num_tokens is in [https://cassandra.apache.org/doc/latest/getting_started/production.html#tokens] as part of CASSANDRA-15618 as well as upgrading information in NEWS.txt. Status: Patch Available (was: In Progress) Pull request: https://github.com/apache/cassandra/pull/663 > Lower default num_tokens > > > Key: CASSANDRA-13701 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13701 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Chris Lohfink >Assignee: Jeremy Hanna >Priority: Low > > For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not > necessary. It is very expensive for operations processes and scanning. Its > come up a lot and its pretty standard and known now to always reduce the > num_tokens within the community. We should just lower the defaults. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418 ] Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 2:04 AM: -- I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times on the latest trunk. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost as the old script was deleted. was (Author: e.dimitrova): I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times on the latest trunk. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost as. the old script was deleted. > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > Attachments: test_multiple_repair_log.txt > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418 ] Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 2:03 AM: -- I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times on the latest trunk. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost as. the old script was deleted. was (Author: e.dimitrova): I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost as. the old script was deleted. > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > Attachments: test_multiple_repair_log.txt > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-15894: Discovered By: User Report (was: Unit Test) > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > Attachments: test_multiple_repair_log.txt > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-15894: Attachment: test_multiple_repair_log.txt > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > Attachments: test_multiple_repair_log.txt > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418 ] Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 1:59 AM: -- I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost as. the old script was deleted. was (Author: e.dimitrova): I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost. > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418 ] Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 1:59 AM: -- I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning behind the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost. was (Author: e.dimitrova): I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning between the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost. > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418 ] Ekaterina Dimitrova commented on CASSANDRA-15894: - I think REPAIR was improved recently as I think I saw some related tickets closed. Now I managed to run this test successfully 100 times. (when I opened this ticket, It was consistently failing on my computer). But I observed in the log (Attached to this ticket) that from time to time the test succeeds only after second run. The first time it times out again. [~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is your advice, should we check further this ticket/particular test at this moment? Thank you in advance :) PS I just tried to find out the reasoning between the setup of running the test two times before it is considered failed. Unfortunately, back in time these tests were moved from a different script to incremental_repair_test.py and at that point some of the git historical data is lost. > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15592) IllegalStateException in gossip after removing node
[ https://issues.apache.org/jira/browse/CASSANDRA-15592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152387#comment-17152387 ] Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15592: --- Hello [~brandon.williams] I ran into the similar Exception, is there any impact of this ERROR or this is just more of logging problem? in my tests I didn't see any impact to the cluster operations. so I would like to know the impact of this before even attempting to upgrade in production > IllegalStateException in gossip after removing node > --- > > Key: CASSANDRA-15592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip >Reporter: Marcus Olsson >Assignee: Marcus Olsson >Priority: Normal > Fix For: 3.0.21, 3.11.7, 4.0, 4.0-alpha4 > > > In one of our test environments we encountered the following exception: > {noformat} > 2020-02-02T10:50:13.276+0100 [GossipTasks:1] ERROR > o.a.c.u.NoSpamLogger$NoSpamLogStatement:97 log > java.lang.IllegalStateException: Attempting gossip state mutation from > illegal thread: GossipTasks:1 > at > org.apache.cassandra.gms.Gossiper.checkProperThreadForStateMutation(Gossiper.java:178) > at org.apache.cassandra.gms.Gossiper.evictFromMembership(Gossiper.java:465) > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:895) > at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:78) > at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:240) > at > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > java.lang.IllegalStateException: Attempting gossip state mutation from > illegal thread: GossipTasks:1 > at > org.apache.cassandra.gms.Gossiper.checkProperThreadForStateMutation(Gossiper.java:178) > [apache-cassandra-3.11.5.jar:3.11.5] > at org.apache.cassandra.gms.Gossiper.evictFromMembership(Gossiper.java:465) > [apache-cassandra-3.11.5.jar:3.11.5] > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:895) > [apache-cassandra-3.11.5.jar:3.11.5] > at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:78) > [apache-cassandra-3.11.5.jar:3.11.5] > at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:240) > [apache-cassandra-3.11.5.jar:3.11.5] > at > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118) > [apache-cassandra-3.11.5.jar:3.11.5] > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_231] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_231] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [na:1.8.0_231] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [na:1.8.0_231] > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) > [apache-cassandra-3.11.5.jar:3.11.5] > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > ~[netty-all-4.1.42.Final.jar:4.1.42.Final] > at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_231] > {noformat} > Since CASSANDRA-15059 we check that all state changes are performed in the > GossipStage but it seems like it was still performed in the "current" thread > [here|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/gms/Gossiper.java#L895]. > It should be as simp
[jira] [Updated] (CASSANDRA-15900) Close channel and reduce buffer allocation during entire sstable streaming with SSL
[ https://issues.apache.org/jira/browse/CASSANDRA-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-15900: - Since Version: 4.0-alpha1 Source Control Link: https://github.com/apache/cassandra/commit/73691944c0ff9b01679cf5a6fe5944ad4c416509 Resolution: Fixed Status: Resolved (was: Ready to Commit) Committed. Thanks, [~maedhroz] and [~jasonstack]! > Close channel and reduce buffer allocation during entire sstable streaming > with SSL > --- > > Key: CASSANDRA-15900 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15900 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: ZhaoYang >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-15740 added the ability to stream entire sstable by loading on-disk > file into user-space off-heap buffer when SSL is enabled, because netty > doesn't support zero-copy with SSL. > But there are two issues: > # file channel is not closed. > # 1mb batch size is used. 1mb exceeds buffer pool's max allocation size, > thus it's all allocated outside the pool and will cause large amount of > allocations. > [Patch|https://github.com/apache/cassandra/pull/651]: > # close file channel when the last batch is loaded into off-heap bytebuffer. > I don't think we need to wait until buffer is flushed by netty. > # reduce the batch to 64kb which is more buffer pool friendly when streaming > entire sstable with SSL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15900) Close channel and reduce buffer allocation during entire sstable streaming with SSL
[ https://issues.apache.org/jira/browse/CASSANDRA-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-15900: - Reviewers: Caleb Rackliffe, Dinesh Joshi (was: Caleb Rackliffe) > Close channel and reduce buffer allocation during entire sstable streaming > with SSL > --- > > Key: CASSANDRA-15900 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15900 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Streaming and Messaging >Reporter: ZhaoYang >Assignee: ZhaoYang >Priority: Normal > Fix For: 4.0-beta > > > CASSANDRA-15740 added the ability to stream entire sstable by loading on-disk > file into user-space off-heap buffer when SSL is enabled, because netty > doesn't support zero-copy with SSL. > But there are two issues: > # file channel is not closed. > # 1mb batch size is used. 1mb exceeds buffer pool's max allocation size, > thus it's all allocated outside the pool and will cause large amount of > allocations. > [Patch|https://github.com/apache/cassandra/pull/651]: > # close file channel when the last batch is loaded into off-heap bytebuffer. > I don't think we need to wait until buffer is flushed by netty. > # reduce the batch to 64kb which is more buffer pool friendly when streaming > entire sstable with SSL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[cassandra] branch trunk updated: Close channel and reduce buffer allocation during entire sstable streaming with SSL
This is an automated email from the ASF dual-hosted git repository. djoshi pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra.git The following commit(s) were added to refs/heads/trunk by this push: new 7369194 Close channel and reduce buffer allocation during entire sstable streaming with SSL 7369194 is described below commit 73691944c0ff9b01679cf5a6fe5944ad4c416509 Author: Zhao Yang AuthorDate: Wed Jun 24 18:37:47 2020 +0800 Close channel and reduce buffer allocation during entire sstable streaming with SSL Patch by Zhao Yang; Reviewed by Caleb Rackliffe and Dinesh Joshi for CASSANDRA-15900 --- CHANGES.txt| 1 + .../cassandra/net/AsyncStreamingOutputPlus.java| 66 ++ .../net/AsyncStreamingOutputPlusTest.java | 58 +++ .../unit/org/apache/cassandra/net/TestChannel.java | 4 +- 4 files changed, 101 insertions(+), 28 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index 8fafb7d..c3fdf4f 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0-alpha5 + * Close channel and reduce buffer allocation during entire sstable streaming with SSL (CASSANDRA-15900) * Prune expired messages less frequently in internode messaging (CASSANDRA-15700) * Fix Ec2Snitch handling of legacy mode for dc names matching both formats, eg "us-west-2" (CASSANDRA-15878) * Add support for server side DESCRIBE statements (CASSANDRA-14825) diff --git a/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java b/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java index e685584..680a9d3 100644 --- a/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java +++ b/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java @@ -23,11 +23,13 @@ import java.nio.ByteBuffer; import java.nio.channels.ClosedChannelException; import java.nio.channels.FileChannel; +import com.google.common.annotations.VisibleForTesting; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import io.netty.channel.Channel; import io.netty.channel.ChannelPromise; +import io.netty.channel.FileRegion; import io.netty.channel.WriteBufferWaterMark; import io.netty.handler.ssl.SslHandler; import org.apache.cassandra.io.compress.BufferType; @@ -161,51 +163,65 @@ public class AsyncStreamingOutputPlus extends AsyncChannelOutputPlus } /** + * Writes all data in file channel to stream: + * * For zero-copy-streaming, 1MiB at a time, with at most 2MiB in flight at once. + * * For streaming with SSL, 64kb at a time, with at most 32+64kb (default low water mark + batch size) in flight. * - * Writes all data in file channel to stream, 1MiB at a time, with at most 2MiB in flight at once. - * This method takes ownership of the provided {@code FileChannel}. + * This method takes ownership of the provided {@link FileChannel}. * * WARNING: this method blocks only for permission to write to the netty channel; it exits before - * the write is flushed to the network. + * the {@link FileRegion}(zero-copy) or {@link ByteBuffer}(ssl) is flushed to the network. */ public long writeFileToChannel(FileChannel file, StreamRateLimiter limiter) throws IOException { -// write files in 1MiB chunks, since there may be blocking work performed to fetch it from disk, -// the data is never brought in process and is gated by the wire anyway if (channel.pipeline().get(SslHandler.class) != null) -return writeFileToChannel(file, limiter, 1 << 20, 1 << 20, 2 << 20); +// each batch is loaded into ByteBuffer, 64kb is more BufferPool friendly. +return writeFileToChannel(file, limiter, 1 << 16); else +// write files in 1MiB chunks, since there may be blocking work performed to fetch it from disk, +// the data is never brought in process and is gated by the wire anyway return writeFileToChannelZeroCopy(file, limiter, 1 << 20, 1 << 20, 2 << 20); } -public long writeFileToChannel(FileChannel fc, StreamRateLimiter limiter, int batchSize, int lowWaterMark, int highWaterMark) throws IOException +@VisibleForTesting +long writeFileToChannel(FileChannel fc, StreamRateLimiter limiter, int batchSize) throws IOException { final long length = fc.size(); long bytesTransferred = 0; -while (bytesTransferred < length) + +try +{ +while (bytesTransferred < length) +{ +int toWrite = (int) min(batchSize, length - bytesTransferred); +final long position = bytesTransferred; + +writeToChannel(bufferSupplier -> { +ByteBuffer outBuffer = bufferSupplier.get(toWrite); +long read = fc.read(outBuffer, position); +if (read != toWri
[jira] [Comment Edited] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152379#comment-17152379 ] Ekaterina Dimitrova edited comment on CASSANDRA-15685 at 7/7/20, 12:10 AM: --- Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your expert advice? Is fixing the test enough here or [this behavior | https://issues.apache.org/jira/browse/CASSANDRA-15685?focusedCommentId=17121396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121396] should also be considered? Thank you in advance :) was (Author: e.dimitrova): Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your expert advice? Is fixing the test enough here or this behavior should also be considered? Thank you in advance :) > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Ekaterina Dimitrova >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: log-CASSANDRA-15685.txt, output > > Time Spent: 10m > Remaining Estimate: 0h > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} > [Circle > CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest
[ https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152379#comment-17152379 ] Ekaterina Dimitrova commented on CASSANDRA-15685: - Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your expert advice? Is fixing the test enough here or this behavior should also be considered? Thank you in advance :) > flaky testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > -- > > Key: CASSANDRA-15685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15685 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Kevin Gallardo >Assignee: Ekaterina Dimitrova >Priority: Normal > Labels: pull-request-available > Fix For: 4.0-beta > > Attachments: log-CASSANDRA-15685.txt, output > > Time Spent: 10m > Remaining Estimate: 0h > > Observed in: > https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests > Failure: > {noformat} > testWithMismatchingPending - > org.apache.cassandra.distributed.test.PreviewRepairTest > junit.framework.AssertionFailedError > at > org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97) > {noformat} > [Circle > CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Issue Comment Deleted] (CASSANDRA-8675) COPY TO/FROM broken for newline characters
[ https://issues.apache.org/jira/browse/CASSANDRA-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jai Bheemsen Rao Dhanwada updated CASSANDRA-8675: - Comment: was deleted (was: I tried the patch, but still running into the issue where if I look at the data with cqlsh I see a yellow '\n' after the import (literal) instead of purple '\n' (control character) ) > COPY TO/FROM broken for newline characters > -- > > Key: CASSANDRA-8675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8675 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Tools > Environment: [cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native > protocol v3] > Ubuntu 14.04 64-bit >Reporter: Lex Lythius >Priority: Normal > Labels: cqlsh, remove-reopen > Fix For: 3.0.x > > Attachments: CASSANDRA-8675.patch, copytest.csv > > > Exporting/importing does not preserve contents when texts containing newline > (and possibly other) characters are involved: > {code:sql} > cqlsh:test> create table if not exists copytest (id int primary key, t text); > cqlsh:test> insert into copytest (id, t) values (1, 'This has a newline > ... character'); > cqlsh:test> insert into copytest (id, t) values (2, 'This has a quote " > character'); > cqlsh:test> insert into copytest (id, t) values (3, 'This has a fake tab \t > character (typed backslash, t)'); > cqlsh:test> select * from copytest; > id | t > +- > 1 | This has a newline\ncharacter > 2 |This has a quote " character > 3 | This has a fake tab \t character (entered slash-t text) > (3 rows) > cqlsh:test> copy copytest to '/tmp/copytest.csv'; > 3 rows exported in 0.034 seconds. > cqlsh:test> copy copytest from '/tmp/copytest.csv'; > 3 rows imported in 0.005 seconds. > cqlsh:test> select * from copytest; > id | t > +--- > 1 | This has a newlinencharacter > 2 | This has a quote " character > 3 | This has a fake tab \t character (typed backslash, t) > (3 rows) > {code} > I tried replacing \n in the CSV file with \\n, which just expands to \n in > the table; and with an actual newline character, which fails with error since > it prematurely terminates the record. > It seems backslashes are only used to take the following character as a > literal > Until this is fixed, what would be the best way to refactor an old table with > a new, incompatible structure maintaining its content and name, since we > can't rename tables? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair
[ https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-15894: Status: In Progress (was: Patch Available) > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > - > > Key: CASSANDRA-15894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15894 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > JAVA 8: test_multiple_repair - > repair_tests.incremental_repair_test.TestIncRepair > Fails locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency
[ https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367 ] Ekaterina Dimitrova edited comment on CASSANDRA-15893 at 7/6/20, 11:29 PM: --- Thank you for your time and work [~Bereng], I know how frustrating this could be but not being able to reproduce easily test failures happens sometimes. I am already looking into it, I am moving the ticket back to work in progress. Thank you one more time! was (Author: e.dimitrova): Thank you for your time and work [~Bereng], I know how frustrating this could be but not being able to reproduce test failures easy happens sometimes. I am already looking into it, I am moving the ticket back to work in progress. Thank you one more time! > JAVA 11: test_short_read - consistency_test.TestConsistency > --- > > Key: CASSANDRA-15893 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15893 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-rc > > > JAVA 11: test_short_read - consistency_test.TestConsistency > Failing locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency
[ https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367 ] Ekaterina Dimitrova edited comment on CASSANDRA-15893 at 7/6/20, 11:26 PM: --- Thank you for your time and work [~Bereng], I know how frustrating this could be but not being able to reproduce test failures easy happens sometimes. I am already looking into it, I am moving the ticket back to work in progress. Thank you one more time! was (Author: e.dimitrova): Thank you for your time and work [~Bereng], I know how frustrating this could be but it happens sometimes. I am already looking into it, I am moving it back to work in progress. Thank you one more time! > JAVA 11: test_short_read - consistency_test.TestConsistency > --- > > Key: CASSANDRA-15893 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15893 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-rc > > > JAVA 11: test_short_read - consistency_test.TestConsistency > Failing locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency
[ https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367 ] Ekaterina Dimitrova commented on CASSANDRA-15893: - Thank you for your time and work [~Bereng], I know how frustrating this could be but it happens sometimes. I am already looking into it, I am moving it back to work in progress. Thank you one more time! > JAVA 11: test_short_read - consistency_test.TestConsistency > --- > > Key: CASSANDRA-15893 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15893 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-rc > > > JAVA 11: test_short_read - consistency_test.TestConsistency > Failing locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency
[ https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova updated CASSANDRA-15893: Status: In Progress (was: Patch Available) > JAVA 11: test_short_read - consistency_test.TestConsistency > --- > > Key: CASSANDRA-15893 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15893 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Ekaterina Dimitrova >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-rc > > > JAVA 11: test_short_read - consistency_test.TestConsistency > Failing locally and in CircleCI: > https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8675) COPY TO/FROM broken for newline characters
[ https://issues.apache.org/jira/browse/CASSANDRA-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152366#comment-17152366 ] Jai Bheemsen Rao Dhanwada commented on CASSANDRA-8675: -- I tried the patch, but still running into the issue where if I look at the data with cqlsh I see a yellow '\n' after the import (literal) instead of purple '\n' (control character) > COPY TO/FROM broken for newline characters > -- > > Key: CASSANDRA-8675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8675 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Tools > Environment: [cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native > protocol v3] > Ubuntu 14.04 64-bit >Reporter: Lex Lythius >Priority: Normal > Labels: cqlsh, remove-reopen > Fix For: 3.0.x > > Attachments: CASSANDRA-8675.patch, copytest.csv > > > Exporting/importing does not preserve contents when texts containing newline > (and possibly other) characters are involved: > {code:sql} > cqlsh:test> create table if not exists copytest (id int primary key, t text); > cqlsh:test> insert into copytest (id, t) values (1, 'This has a newline > ... character'); > cqlsh:test> insert into copytest (id, t) values (2, 'This has a quote " > character'); > cqlsh:test> insert into copytest (id, t) values (3, 'This has a fake tab \t > character (typed backslash, t)'); > cqlsh:test> select * from copytest; > id | t > +- > 1 | This has a newline\ncharacter > 2 |This has a quote " character > 3 | This has a fake tab \t character (entered slash-t text) > (3 rows) > cqlsh:test> copy copytest to '/tmp/copytest.csv'; > 3 rows exported in 0.034 seconds. > cqlsh:test> copy copytest from '/tmp/copytest.csv'; > 3 rows imported in 0.005 seconds. > cqlsh:test> select * from copytest; > id | t > +--- > 1 | This has a newlinencharacter > 2 | This has a quote " character > 3 | This has a fake tab \t character (typed backslash, t) > (3 rows) > {code} > I tried replacing \n in the CSV file with \\n, which just expands to \n in > the table; and with an actual newline character, which fails with error since > it prematurely terminates the record. > It seems backslashes are only used to take the following character as a > literal > Until this is fixed, what would be the best way to refactor an old table with > a new, incompatible structure maintaining its content and name, since we > can't rename tables? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Description: h4. Problem The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} for the current offset in the region. Allocations depends on a {{.compareAndSet(..)}} call. In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node. h4. Example It has been witnessed up to 33% of CPU time stuck in the {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a heavy spark analytics write load. These nodes: 40 CPU cores and 256GB ram; have relevant settings - {{memtable_allocation_type: offheap_objects}} - {{memtable_offheap_space_in_mb: 5120}} - {{concurrent_writes: 160}} Numerous flamegraphs demonstrate the problem. See attached [^profile_pbdpc23zafsrh_20200702.svg]. h4. Suggestion: ThreadLocal Regions One possible solution is to have separate Regions per thread. Code wise this is relatively easy to do, for example replacing NativeAllocator:59 {code}private final AtomicReference currentRegion = new AtomicReference<>();{code} with {code}private final ThreadLocal> currentRegion = new ThreadLocal<>() {...};{code} But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time. The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm. Applying the Constant Backoff CAS Algorithm involves adding one line of code to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant number) nanoseconds after a CAS failure occurs. That is... {code} // we raced and lost alloc, try again LockSupport.parkNanos(1); {code} h4. Constant Backoff CAS Algorithm Experiments Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. In the attached [^NativeAllocatorRegionTest.java] class, which can be run standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has also the {{casFailures}} field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the {{NativeAllocatorRegionTest.testRegionCAS}} method. This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 million allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 1000ns (10ms). The biggest improvement is from no algorithm to a park time of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates. !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a park time of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates. !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! Repeating the test run show reliably similar results: [^Screen Shot 2020-07-05 at 13.37.01.png] and [^Screen Shot 2020-07-05 at 13.35.55.png]. h4. Region Per Thread Experiments Implementing Region Per Thread: see the {{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms). !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! h4. Costs Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation. The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further. was: h4. Problem The method {{NativeAllocat
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252 ] Michael Semb Wever edited comment on CASSANDRA-15234 at 7/6/20, 8:30 PM: - {quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, Benjamin Lerer, Michael Semb Wever, do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} Let's keep listening to what everyone has to say. Some of us are better with the written word than others, it is a second language for some, and for me as a native-english-speaker it is still all too easy to miss things the first time they are said. On that, I believe everyone hears and recognises what [~e.dimitrova] is saying here regarding frustrations about such a substantial change being suggested so late in the game and the amount of time that's been asked to re-invest. Especially when an almost identical user experience improvement was presented two months ago. But it should be said again. On a side-note, it would have really helped me a lot if the comment [above|https://issues.apache.org/jira/browse/CASSANDRA-15234?focusedCommentId=17150521&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17150521] back-referenced [this discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] where it originated. I know the ticket was referenced, but that discussion thread is the source of the suggestion. {quote}This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. {quote} AFAIK this is the only "grounded" justification for the veto. I don't agree that we are forced into that premise. We can get around those compatibility rules with a minimal amount of effort, by not deprecating the old API and not announcing (in docs or yaml) the new API. (I would expect everyone to intuitively treat private undocumented and un-referenced APIs, only ever available in alpha and beta releases, as unsupported.) All that "compatibility change" can be left and just done once in the separate ticket. The underlying framework and bulk of this patch can still be merged. Based on that I see three possible courses of action: 1. Accept investigating the alternative proposal, and include it in this ticket, delaying our first 4.0-beta release, 2. As (1) but requesting this ticket to be merged during 4.0-beta, so we can release 4.0-beta now, 3. Spin out the new suggestion and all public API changes to a separate ticket, slated for 4.0-beta, and merging this ticket. I suspect, since you have offered to help [~benedict], that most are in favour of (2) ? was (Author: michaelsembwever): {quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, Benjamin Lerer, Michael Semb Wever, do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} Let's keep listening to what everyone has to say. Some of us are better with the written word than others, it is a second language for some, and for me as a native-english-speaker it is still all too easy to miss things the first time they are said. On that, I believe everyone hears and recognises what [~e.dimitrova] is saying here regarding frustrations about such a substantial change being suggested so late in the game and the amount of time that's been asked to re-invest. Especially when an almost identical user experience improvement was presented two months ago. But it should be said again. On a side-note, it would have really helped me a lot if the comment [above|https://issues.apache.org/jira/browse/CASSANDRA-15234?focusedCommentId=17150521&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17150521] back-referenced [this discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] where it originated. I know the ticket was referenced, but that discussion thread is the source of the suggestion. {quote}This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. {quote} AFAIK this is the only "grounded" justification for the veto. I don't agree that we are forced into that premise. We can get around those compatibility rules with a minimal amount of effort, by not deprecating the old API and not announcing (in docs or yaml) the new API. (I would expect everyone to intuitive
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152266#comment-17152266 ] David Capwell commented on CASSANDRA-15234: --- Reread the conversations that have been going on over the past 3 days several times, sorry if I missed anything or didn't grasp all points. Most of the thread is about doing an all or nothing approach, thanks [~mck] for trying to argue for incremental improvement. Looking at the list of properties impacted (see https://github.com/apache/cassandra/compare/trunk...ekaterinadimitrova2:CASSANDRA-15234-new#diff-4302f2407249672d7845cd58027ff6e9R257-R339) it looks like a subset would be clearly impacted by the grouping approach, and others not so much or are complementary; given this we could accept a hand full of the properties and move the other properties into the grouping work (stuff such as read_request_timeout_in_ms to read_request_timeout I feel are fine even with the grouping approach, but stuff which renames things maybe leave out for now, such as enable_user_defined_functions to user_defined_functions_enabled). I do agree with [~benedict] that it isn't ok to keep changing our config API since this is user facing, we should be strict about user facing changes and try to help more than harm. If there is a belief that one structure is better than another then I value this dialog and hope we can get more eyes from the users/operators to see their thoughts; for this work we should really speak about the YAML representation rather than the code so we can agree on the final result. Also, given the framework that is provided by this patch, I don't see that work as throwing everything away, instead I see it benefiting from the work which is started. Given the work involved is to add support for "moving" a field (current "rename" is a special case of move where the move is at the same level) from one location to another (rename and conversion already supported), this adds complexity for the case where the new and the old field are both used and may hit complexity issues with SnakeYaml implementation. I do believe we should have this discussion and settle on a solution before releasing 4.0.0, but I do not feel that this discussion blocks a beta release. There is a lot of chatting about being a beta blocker but I don't really follow why this JIRA (or the grouping one) is a blocker. Reading https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle I don't see why this JIRA could not be done during beta, it meets every requirement for this phase. Given all the comments above, my TL;DR * Can we find a subset of properties with the current patch which are not discarded by the grouping work (sample given above) * Can we start the conversation and start asking operators of Cassandra clusters on their thoughts on grouping vs not grouping? Grouping could be nice for humans but could be horrid for some automation (I am neither pro or against grouping, I defer to operators preference here). * Can we mark this ticket and the grouping one as non-blocking for beta > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252 ] Michael Semb Wever edited comment on CASSANDRA-15234 at 7/6/20, 7:09 PM: - {quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, Benjamin Lerer, Michael Semb Wever, do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} Let's keep listening to what everyone has to say. Some of us are better with the written word than others, it is a second language for some, and for me as a native-english-speaker it is still all too easy to miss things the first time they are said. On that, I believe everyone hears and recognises what [~e.dimitrova] is saying here regarding frustrations about such a substantial change being suggested so late in the game and the amount of time that's been asked to re-invest. Especially when an almost identical user experience improvement was presented two months ago. But it should be said again. On a side-note, it would have really helped me a lot if the comment [above|https://issues.apache.org/jira/browse/CASSANDRA-15234?focusedCommentId=17150521&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17150521] back-referenced [this discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] where it originated. I know the ticket was referenced, but that discussion thread is the source of the suggestion. {quote}This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. {quote} AFAIK this is the only "grounded" justification for the veto. I don't agree that we are forced into that premise. We can get around those compatibility rules with a minimal amount of effort, by not deprecating the old API and not announcing (in docs or yaml) the new API. (I would expect everyone to intuitively treat private undocumented and un-referenced APIs, only ever available in alpha and beta releases, to be considered unsupported.) All that "compatibility change" can be left and just done once in the separate ticket. The underlying framework and bulk of this patch can still be merged. Based on that I see three possible courses of action: 1. Accept investigating the alternative proposal, and include it in this ticket, delaying our first 4.0-beta release, 2. As (1) but requesting this ticket to be merged during 4.0-beta, so we can release 4.0-beta now, 3. Spin out the new suggestion and all public API changes to a separate ticket, slated for 4.0-beta, and merging this ticket. I suspect, since you have offered to help [~benedict], that most are in favour of (2) ? was (Author: michaelsembwever): {quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, Benjamin Lerer, Michael Semb Wever, do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} Let's keep listening to what everyone has to say. Some of us are better with the written word than others, it is a second language for some, and for me as a native-english-speaker it is still all too easy to miss things the first time they are said. On that, I believe everyone hears and recognises what [~e.dimitrova] is saying here regarding frustrations about such a substantial change being suggested so late in the game and the amount of time that's been asked to re-invest. Especially when an almost identical user experience improvement was presented two months ago. But it should be said again. On a side-note, it would have really helped me a lot if the comment above back-referenced [this discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] where it originated. I know the ticket was referenced, but that discussion thread is the source of the suggestion. {quote}This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. {quote} AFAIK this is the only "grounded" justification for the veto. I don't agree that we are forced into that premise. We can get around those compatibility rules with a minimal amount of effort, by not deprecating the old API and not announcing (in docs or yaml) the new API. (I would expect everyone to intuitively treat private undocumented and un-referenced APIs, only ever available in alpha and beta releases, to be considered unsupported.) All that "compatibil
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252 ] Michael Semb Wever commented on CASSANDRA-15234: {quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, Benjamin Lerer, Michael Semb Wever, do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} Let's keep listening to what everyone has to say. Some of us are better with the written word than others, it is a second language for some, and for me as a native-english-speaker it is still all too easy to miss things the first time they are said. On that, I believe everyone hears and recognises what [~e.dimitrova] is saying here regarding frustrations about such a substantial change being suggested so late in the game and the amount of time that's been asked to re-invest. Especially when an almost identical user experience improvement was presented two months ago. But it should be said again. On a side-note, it would have really helped me a lot if the comment above back-referenced [this discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] where it originated. I know the ticket was referenced, but that discussion thread is the source of the suggestion. {quote}This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. {quote} AFAIK this is the only "grounded" justification for the veto. I don't agree that we are forced into that premise. We can get around those compatibility rules with a minimal amount of effort, by not deprecating the old API and not announcing (in docs or yaml) the new API. (I would expect everyone to intuitively treat private undocumented and un-referenced APIs, only ever available in alpha and beta releases, to be considered unsupported.) All that "compatibility change" can be left and just done once in the separate ticket. The underlying framework and bulk of this patch can still be merged. Based on that I see three possible courses of action: 1. Accept investigating the alternative proposal, and include it in this ticket, delaying our first 4.0-beta release, 2. As (1) but requesting this ticket to be merged during 4.0-beta, so we can release 4.0-beta now, 3. Spin out the new suggestion and all public API changes to a separate ticket, slated for 4.0-beta, and merging this ticket. I suspect, since you have offered to help [~benedict], that most are in favour of (2) ? > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection
[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-15907: Reviewers: Andres de la Peña, Jordan West, Caleb Rackliffe (was: Andres de la Peña, Caleb Rackliffe, Jordan West) Andres de la Peña, Jordan West, Caleb Rackliffe (was: Andres de la Peña, Jordan West) Status: Review In Progress (was: Patch Available) > Operational Improvements & Hardening for Replica Filtering Protection > - > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Testing* > * I want to bite the bullet and get some basic tests for RFP (including any > guardrails we might add here) onto the in-JVM dtest framework. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexity and accuracy is having something along the lines of > cached rows per query. Prior art suggests this probably makes sense alongside > things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection
[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152248#comment-17152248 ] Caleb Rackliffe edited comment on CASSANDRA-15907 at 7/6/20, 6:58 PM: -- [~jwest] I've hopefully addressed the points from [~adelapena]'s first round of review, so I think this is officially ready for a second reviewer. 3.0: [patch|https://github.com/apache/cassandra/pull/659], [CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97] If we're happy with the implementation, the next step will be to do some basic stress testing. was (Author: maedhroz): [~jwest] I've hopefully addressed the points from [~adelapena]'s first round of review, so I think this is officially ready for a second reviewer. 3.0: [patch|https://github.com/apache/cassandra/pull/659], [CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97] > Operational Improvements & Hardening for Replica Filtering Protection > - > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Testing* > * I want to bite the bullet and get some basic tests for RFP (including any > guardrails we might add here) onto the in-JVM dtest framework. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexity and accuracy is having something along the lines of > cached rows per query. Prior art suggests this probably makes sense alongside > things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}. -- This me
[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection
[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-15907: Test and Documentation Plan: The first line of defense against regression here is the set of dtests built for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll need at minimum a basic battery of in-JVM dtests around the new guardrails. Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering workload to stress things a bit, both to see how things behave with larger sets of query results when filtering protection isn't activated, and to see how the thresholds work when we have severely out-of-sync replicas. was: The first line of defense against regression here is the set of dtests built for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll need at minimum a basic battery of in-JVM dtests around the new guardrails. Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering workload to stress things a bit, both to see how things behave with larger sets of query results when filtering protection isn't activated, and to see how the thresholds work when we have severely out-of-sync replicas. > Operational Improvements & Hardening for Replica Filtering Protection > - > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Testing* > * I want to bite the bullet and get some basic tests for RFP (including any > guardrails we might add here) onto the in-JVM dtest framework. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexit
[jira] [Commented] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection
[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152248#comment-17152248 ] Caleb Rackliffe commented on CASSANDRA-15907: - [~jwest] I've hopefully addressed the points from [~adelapena]'s first round of review, so I think this is officially ready for a second reviewer. 3.0: [patch|https://github.com/apache/cassandra/pull/659], [CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97] > Operational Improvements & Hardening for Replica Filtering Protection > - > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Testing* > * I want to bite the bullet and get some basic tests for RFP (including any > guardrails we might add here) onto the in-JVM dtest framework. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexity and accuracy is having something along the lines of > cached rows per query. Prior art suggests this probably makes sense alongside > things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection
[ https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Caleb Rackliffe updated CASSANDRA-15907: Test and Documentation Plan: The first line of defense against regression here is the set of dtests built for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll need at minimum a basic battery of in-JVM dtests around the new guardrails. Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering workload to stress things a bit, both to see how things behave with larger sets of query results when filtering protection isn't activated, and to see how the thresholds work when we have severely out-of-sync replicas. was:The first line of defense against regression here is the set of dtests built for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll need at minimum a basic battery of in-JVM dtests around the new guardrails. Status: Patch Available (was: In Progress) > Operational Improvements & Hardening for Replica Filtering Protection > - > > Key: CASSANDRA-15907 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15907 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Coordination, Feature/2i Index >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Labels: 2i, memory > Fix For: 3.0.x, 3.11.x, 4.0-beta > > > CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i > and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a > few things we should follow up on, however, to make life a bit easier for > operators and generally de-risk usage: > (Note: Line numbers are based on {{trunk}} as of > {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.) > *Minor Optimizations* > * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be > able to use simple arrays instead of lists for {{rowsToFetch}} and > {{originalPartitions}}. Alternatively (or also), we may be able to null out > references in these two collections more aggressively. (ex. Using > {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, > assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.) > * {{ReplicaFilteringProtection:323}} - We may be able to use > {{EncodingStats.merge()}} and remove the custom {{stats()}} method. > * {{DataResolver:111 & 228}} - Cache an instance of > {{UnaryOperator#identity()}} instead of creating one on the fly. > * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather > rather than serially querying every row that needs to be completed. This > isn't a clear win perhaps, given it targets the latency of single queries and > adds some complexity. (Certainly a decent candidate to kick even out of this > issue.) > *Documentation and Intelligibility* > * There are a few places (CHANGES.txt, tracing output in > {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side > filtering protection" (which makes it seem like the coordinator doesn't > filter) rather than "replica filtering protection" (which sounds more like > what we actually do, which is protect ourselves against incorrect replica > filtering results). It's a minor fix, but would avoid confusion. > * The method call chain in {{DataResolver}} might be a bit simpler if we put > the {{repairedDataTracker}} in {{ResolveContext}}. > *Testing* > * I want to bite the bullet and get some basic tests for RFP (including any > guardrails we might add here) onto the in-JVM dtest framework. > *Guardrails* > * As it stands, we don't have a way to enforce an upper bound on the memory > usage of {{ReplicaFilteringProtection}} which caches row responses from the > first round of requests. (Remember, these are later used to merged with the > second round of results to complete the data for filtering.) Operators will > likely need a way to protect themselves, i.e. simply fail queries if they hit > a particular threshold rather than GC nodes into oblivion. (Having control > over limits and page sizes doesn't quite get us there, because stale results > _expand_ the number of incomplete results we must cache.) The fun question is > how we do this, with the primary axes being scope (per-query, global, etc.) > and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). > My starting disposition on the right trade-off between > performance/complexity and accuracy is having something along the lines of > cached rows per query. Prior art suggests this probably makes sense alongside > things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}. -- This message was sen
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152224#comment-17152224 ] Caleb Rackliffe edited comment on CASSANDRA-15234 at 7/6/20, 6:00 PM: -- My mental model of why grouping _might_ be valuable: * It provides a logical place to describe/comment on entire features in the YAML. * It avoids duplicate/unwieldy prefixing without sacrificing intelligibility/specificity. * It doesn't rely on the presence of comments. My understanding of the changes here is that there are dozens of options that have already been renamed. Assuming we proceed with grouping, supporting three different forms of these options doesn't seem like the outcome we want. There are really only a handful of groupings that would be interesting and obvious. Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe the caches. (Timeouts seem a bit scattered.) What I'm most worried about is the number of versions we have to support at any given time, not whether we change some option grouping early in the beta period. My vote, at this point, would be to just move this issue to beta and hash out a proposal for the (somewhat obvious) option groups I've mentioned above. was (Author: maedhroz): My mental model of why grouping _might_ be valuable: * It provides a logical place to describe/comment on entire features in the YAML. * It avoids duplicate prefixing without sacrificing intelligibility/specificity. * It doesn't rely on the presence of comments. My understanding of the changes here is that there are dozens of options that have already been renamed. Assuming we proceed with grouping, supporting three different forms of these options doesn't seem like the outcome we want. There are really only a handful of groupings that would be interesting and obvious. Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe the caches. (Timeouts seem a bit scattered.) What I'm most worried about is the number of versions we have to support at any given time, not whether we change some option grouping early in the beta period. My vote, at this point, would be to just move this issue to beta and hash out a proposal for the (somewhat obvious) option groups I've mentioned above. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152224#comment-17152224 ] Caleb Rackliffe commented on CASSANDRA-15234: - My mental model of why grouping _might_ be valuable: * It provides a logical place to describe/comment on entire features in the YAML. * It avoids duplicate prefixing without sacrificing intelligibility/specificity. * It doesn't rely on the presence of comments. My understanding of the changes here is that there are dozens of options that have already been renamed. Assuming we proceed with grouping, supporting three different forms of these options doesn't seem like the outcome we want. There are really only a handful of groupings that would be interesting and obvious. Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe the caches. (Timeouts seem a bit scattered.) What I'm most worried about is the number of versions we have to support at any given time, not whether we change some option grouping early in the beta period. My vote, at this point, would be to just move this issue to beta and hash out a proposal for the (somewhat obvious) option groups I've mentioned above. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other
[ https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715 ] David Capwell commented on CASSANDRA-15909: --- bq. The good news is that these two are really new and only present in trunk (and only added in the last few weeks), so we don't need to bother with deprecation. Sounds good to me > Make Table/Keyspace Metric Names Consistent With Each Other > --- > > Key: CASSANDRA-15909 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15909 > Project: Cassandra > Issue Type: Improvement > Components: Observability/Metrics >Reporter: Stephen Mallette >Assignee: Stephen Mallette >Priority: Normal > Fix For: 4.0-beta > > > As part of CASSANDRA-15821 it became apparent that certain metric names found > in keyspace and tables had different names but were in fact the same metric - > they are as follows: > * Table.SyncTime == Keyspace.RepairSyncTime > * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows > * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime > * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize > * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize > * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize > * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize > Also, client metrics are the only metrics to start with a lower case letter. > Change those to upper case to match all the other metrics. > Unifying this naming would help make metrics more consistent as part of > CASSANDRA-15582 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:08 PM: -- Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} I think this is indeed preferable to releasing an API we already expect to deprecate, however I think we're overstating the difficulty here. We haven't debated the parameter naming much at all, and we can easily land this in 4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an acceptable window to land the work, I can take a look in a few weeks. {quote} I want to be clear - it is not about difficulty, this patch is time consuming. It needs attention to the detail and look at the whole config which touches the code at many places(also ccm, dtests, in-jvm tests, etc) was (Author: e.dimitrova): Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that p
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:07 PM: -- Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} I think this is indeed preferable to releasing an API we already expect to deprecate, however I think we're overstating the difficulty here. We haven't debated the parameter naming much at all, and we can easily land this in 4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an acceptable window to land the work, I can take a look in a few weeks. \{quote} I want to be clear - it is not about difficulty, this patch is time consuming. It needs attention to the detail and look at the whole config which touches the code at many places(also ccm, dtests, in-jvm tests, etc) was (Author: e.dimitrova): Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that p
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:06 PM: -- Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote}I think this is indeed preferable to releasing an API we already expect to deprecate, however I think we're overstating the difficulty here. We haven't debated the parameter naming much at all, and we can easily land this in 4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an acceptable window to land the work, I can take a look in a few weeks. \{quote} I want to be clear - it is not about difficulty, this patch is time consuming. It needs attention to the detail and look at the whole config which touches the code at many places(also ccm, dtests, in-jvm tests, etc) was (Author: e.dimitrova): Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that pe
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:02 PM: -- Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? {quote} I think this is indeed preferable to releasing an API we already expect to deprecate, however I think we're overstating the difficulty here. We haven't debated the parameter naming much at all, and we can easily land this in 4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an acceptable window to land the work, I can take a look in a few weeks. \{quote} I want to be clear - it is not about difficulty, this patch is time consuming. It needs attention to the detail and look at the whole config which touches the code at many places(also ccm, dtests, in-jvm tests, etc) was (Author: e.dimitrova): Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that pers
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 4:57 PM: -- Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread . This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? was (Author: e.dimitrova): Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread. This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml f
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159 ] Ekaterina Dimitrova commented on CASSANDRA-15234: - Apologize for my late response. I was a bit sick these days and tried to disengage from work and take some rest over the weekend. With all my respect to everyone's opinion and experience on the project, I have two points here: - I truly support [~mck]'s questions. I believe they should be responded before any decision is taken and someone jumps into actual work. {quote}how many settings does it apply to? is taxonomy based on a technical or user perspective? if user/operator based, how many people need to be involved to get it right? if user/operator based, what if one property applies to multiple concerns? how does the @Replace annotation work between levels in the grouping? does this introduce more complexity/variations in what has to be tested? (since yaml can consist of old and new setting names) {quote} - I was also wondering today while I was trying to be open-minded and look from all perspectives at this ticket/patch... Did anyone check the first [commit |https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml] where I was suggesting reorganizing of the text into the yaml into sections? I also put it into the ticket thread. This was a quick draft shared two months ago that could be reworked to sections that satisfy the users' requirements for clarity and consistency. Do we see any big difference for the users between: {code:java} #*Replica Filtering Protection* cached_rows_warn_threshold: 1000 cached_rows_fail_threshold: 16000 {code} and: {code:java} replica_filtering_protection: - cached_rows_warn_threshold: 1000 - cached_rows_fail_threshold: 16000 {code} >From that perspective, I think the C* community can accept this patch and then >we can raise a new ticket) to improve the internals from our engineering >perspective in Beta(refactoring the Config class and the backward >compatibility framework), as suggested by [~mck]. I think this work could be >really considered incremental work. Having that in mind, honestly, I don't find a justification to spend my time to rework and fully re-test the patch at this point in time. I am fine to be proved wrong in a justified way. [~benedict], [~blerer], [~mck], do you agree with me on my suggestion(reorganizing the yaml file and doing the nested parameters approach later)? > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList
[ https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-15924: Reviewers: Alex Petrov, Sylvain Lebresne [~ifesdjeen] & [~slebresne] do you have time to review? > Avoid emitting empty range tombstones from RangeTombstoneList > - > > Key: CASSANDRA-15924 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15924 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > In {{RangeTombstoneList#iterator}} there is a chance we emit empty range > tombstones depending on the slice passed in. This can happen during read > repair with either an empty slice or with paging and the final page being > empty. > This creates problems in RTL if we try to insert a new range tombstone which > covers the empty ones; > {code} > Caused by: java.lang.AssertionError > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541) > at > org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217) > at > org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137) > at org.apache.cassandra.db.Memtable.put(Memtable.java:254) > at > org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210) > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:210) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:215) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:224) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList
[ https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-15924: Test and Documentation Plan: new tests Status: Patch Available (was: Open) https://github.com/krummas/cassandra/commits/marcuse/15924 (also includes a fix to RowAndDeletionMergeIterator to make sure there are no other paths creating these empty tombstones) unit tests: https://circleci.com/gh/krummas/cassandra/3440 jvm dtests: https://circleci.com/gh/krummas/cassandra/3441 > Avoid emitting empty range tombstones from RangeTombstoneList > - > > Key: CASSANDRA-15924 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15924 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > In {{RangeTombstoneList#iterator}} there is a chance we emit empty range > tombstones depending on the slice passed in. This can happen during read > repair with either an empty slice or with paging and the final page being > empty. > This creates problems in RTL if we try to insert a new range tombstone which > covers the empty ones; > {code} > Caused by: java.lang.AssertionError > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541) > at > org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217) > at > org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137) > at org.apache.cassandra.db.Memtable.put(Memtable.java:254) > at > org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210) > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:210) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:215) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:224) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9739) Migrate counter-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152139#comment-17152139 ] Aleksey Yeschenko commented on CASSANDRA-9739: -- Sure > Migrate counter-cache to be fully off-heap > -- > > Key: CASSANDRA-9739 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9739 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Core >Reporter: Robert Stupp >Assignee: Robert Stupp >Priority: Normal > Fix For: 4.x > > > Counter cache still uses a concurrent map on-heap. This could go to off-heap > and feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152137#comment-17152137 ] Jordan West commented on CASSANDRA-15579: - No objection to splitting. I think this was intended as a parent to sub-tasks with more specific scope. > 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, > and Read Repair > > > Key: CASSANDRA-15579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15579 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Josh McKenzie >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-beta > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Blake Eggleston* > Testing in this area focuses on non-node-local aspects of the read-write > path: coordination, replication, read repair, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152134#comment-17152134 ] Andres de la Peña edited comment on CASSANDRA-15579 at 7/6/20, 4:21 PM: bq. Also, just a structural nit, we could easily break this Jira into two...one dealing with coordination/replication and the other dealing with read repair. +1 to breaking it into two, that would help us to reduce the potentially vast scope of the ticket. was (Author: adelapena): > Also, just a structural nit, we could easily break this Jira into two...one > dealing with coordination/replication and the other dealing with read repair. +1 to breaking it into two, that would help us to reduce the potentially vast scope of the ticket. > 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, > and Read Repair > > > Key: CASSANDRA-15579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15579 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Josh McKenzie >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-beta > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Blake Eggleston* > Testing in this area focuses on non-node-local aspects of the read-write > path: coordination, replication, read repair, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152134#comment-17152134 ] Andres de la Peña commented on CASSANDRA-15579: --- > Also, just a structural nit, we could easily break this Jira into two...one > dealing with coordination/replication and the other dealing with read repair. +1 to breaking it into two, that would help us to reduce the potentially vast scope of the ticket. > 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, > and Read Repair > > > Key: CASSANDRA-15579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15579 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Josh McKenzie >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-beta > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Blake Eggleston* > Testing in this area focuses on non-node-local aspects of the read-write > path: coordination, replication, read repair, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair
[ https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152125#comment-17152125 ] Caleb Rackliffe commented on CASSANDRA-15579: - Also, just a structural nit, we could easily break this Jira into two...one dealing with coordination/replication and the other dealing with read repair. > 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, > and Read Repair > > > Key: CASSANDRA-15579 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15579 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Josh McKenzie >Assignee: Andres de la Peña >Priority: Normal > Fix For: 4.0-beta > > > Reference [doc from > NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#] > for context. > *Shepherd: Blake Eggleston* > Testing in this area focuses on non-node-local aspects of the read-write > path: coordination, replication, read repair, etc. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList
[ https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcus Eriksson updated CASSANDRA-15924: Bug Category: Parent values: Availability(12983)Level 1 values: Response Crash(12991) Complexity: Normal Component/s: Consistency/Coordination Discovered By: Unit Test Fix Version/s: 4.x 3.11.x 3.0.x Severity: Normal Status: Open (was: Triage Needed) > Avoid emitting empty range tombstones from RangeTombstoneList > - > > Key: CASSANDRA-15924 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15924 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.x > > > In {{RangeTombstoneList#iterator}} there is a chance we emit empty range > tombstones depending on the slice passed in. This can happen during read > repair with either an empty slice or with paging and the final page being > empty. > This creates problems in RTL if we try to insert a new range tombstone which > covers the empty ones; > {code} > Caused by: java.lang.AssertionError > at > org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541) > at > org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217) > at > org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141) > at > org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137) > at org.apache.cassandra.db.Memtable.put(Memtable.java:254) > at > org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210) > at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) > at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:210) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:215) > at org.apache.cassandra.db.Mutation.apply(Mutation.java:224) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582) > at > org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList
Marcus Eriksson created CASSANDRA-15924: --- Summary: Avoid emitting empty range tombstones from RangeTombstoneList Key: CASSANDRA-15924 URL: https://issues.apache.org/jira/browse/CASSANDRA-15924 Project: Cassandra Issue Type: Bug Reporter: Marcus Eriksson Assignee: Marcus Eriksson In {{RangeTombstoneList#iterator}} there is a chance we emit empty range tombstones depending on the slice passed in. This can happen during read repair with either an empty slice or with paging and the final page being empty. This creates problems in RTL if we try to insert a new range tombstone which covers the empty ones; {code} Caused by: java.lang.AssertionError at org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541) at org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217) at org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141) at org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137) at org.apache.cassandra.db.Memtable.put(Memtable.java:254) at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210) at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573) at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421) at org.apache.cassandra.db.Mutation.apply(Mutation.java:210) at org.apache.cassandra.db.Mutation.apply(Mutation.java:215) at org.apache.cassandra.db.Mutation.apply(Mutation.java:224) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582) at org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152110#comment-17152110 ] Benedict Elliott Smith commented on CASSANDRA-15234: > There are no true urgency to fix that ticket so if we want to go the grouping > way within the scope of that ticket we should move it to 4.X. I think this is indeed preferable to releasing an API we already expect to deprecate, however I think we're overstating the difficulty here. We haven't debated the parameter naming much at all, and we can easily land this in 4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an acceptable window to land the work, I can take a look in a few weeks. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152103#comment-17152103 ] Benedict Elliott Smith commented on CASSANDRA-15234: > We have un-tested beliefs about a potentially superior design We don't ordinarily label our judgements about design "un-tested beliefs" and I think it would help to avoid this kind of rhetoric. If we all start labelling design decisions in this way the project might grind to a halt. I have anyway tried specifically to sidestep this kind of accusation, by leaving the ball in your court. I am simply asking those pushing to move ahead with the current proposal to endorse the view that it is superior. This is a very weak criteria to meet, and involves no beliefs external to yourselves. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152102#comment-17152102 ] Benjamin Lerer commented on CASSANDRA-15234: Argeeing on grouping will take a significant amount of time. Specially now where a lot of people are pretty busy with other tasks. There are no true urgency to fix that ticket so if we want to go the grouping way within the scope of that ticket we should move it to 4.X. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15889) Debian package fails to download on Arm-based hosts
[ https://issues.apache.org/jira/browse/CASSANDRA-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152071#comment-17152071 ] Matt Davis commented on CASSANDRA-15889: Just checking if there's been any movement here, thanks! > Debian package fails to download on Arm-based hosts > --- > > Key: CASSANDRA-15889 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15889 > Project: Cassandra > Issue Type: Bug >Reporter: Matt Davis >Priority: Normal > > Following the first three steps of the [Debian install > process|https://cassandra.apache.org/download/], after an apt-get update > you'll see this line: > {code:bash} > $ sudo apt-get update > ... > N: Skipping acquire of configured file 'main/binary-arm64/Packages' as > repository 'https://downloads.apache.org/cassandra/debian 311x InRelease' > doesn't support architecture 'arm64' > {code} > Checking the [Debian > repo|https://dl.bintray.com/apache/cassandra/dists/311x/main/] confirms there > is no aarch64 variant available. > Should you then attempt to install Cassandra: > {code:bash} > $ sudo apt-get install cassandra > Reading package lists... Done > Building dependency tree > Reading state information... Done > Package cassandra is not available, but is referred to by another package. > This may mean that the package is missing, has been obsoleted, or > is only available from another source > E: Package 'cassandra' has no installation candidate > {code} > The Redhat RPM contains a "noarch" arch type, so it will download on any > host. (Cassandra does not use separate binaries/releases for different > architectures, so this seems to be the correct approach, but adding an > aarch64 variant would also suffice.) > Note that there is a workaround available: if you specify "amd64" as the arch > for the source, it downloads and runs on Arm without issue: > {code:bash} > deb [arch=amd64] https://downloads.apache.org/cassandra/debian 311x main > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-9555) Don't let offline tools run while cassandra is running
[ https://issues.apache.org/jira/browse/CASSANDRA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp reassigned CASSANDRA-9555: --- Assignee: (was: Robert Stupp) > Don't let offline tools run while cassandra is running > -- > > Key: CASSANDRA-9555 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9555 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Tools >Reporter: Marcus Eriksson >Priority: Low > Fix For: 4.x > > > We should not let offline tools that modify sstables run while Cassandra is > running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13838) Ensure all threads are FastThreadLocal.removeAll() is called for all threads
[ https://issues.apache.org/jira/browse/CASSANDRA-13838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp updated CASSANDRA-13838: - Status: Open (was: Patch Available) > Ensure all threads are FastThreadLocal.removeAll() is called for all threads > > > Key: CASSANDRA-13838 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13838 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Core >Reporter: Robert Stupp >Assignee: Robert Stupp >Priority: Normal > > There are a couple of places, there it's not guaranteed that > FastThreadLocal.removeAll() is called. Most misses are actually not that > critical, but the miss for the thread created via in > org.apache.cassandra.streaming.ConnectionHandler.MessageHandler#start(java.net.Socket, > int, boolean) could be critical, because these threads are created for every > stream-session. > (Follow-up from CASSANDRA-13754) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-9739) Migrate counter-cache to be fully off-heap
[ https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152061#comment-17152061 ] Robert Stupp commented on CASSANDRA-9739: - [~aleksey] do you mind if i close this one? > Migrate counter-cache to be fully off-heap > -- > > Key: CASSANDRA-9739 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9739 > Project: Cassandra > Issue Type: Sub-task > Components: Legacy/Core >Reporter: Robert Stupp >Assignee: Robert Stupp >Priority: Normal > Fix For: 4.x > > > Counter cache still uses a concurrent map on-heap. This could go to off-heap > and feels doable now after CASSANDRA-8099. > Evaluation should be done in advance based on a POC to prove that pure > off-heap counter cache buys a performance and/or gc-pressure improvement. > In theory, elimination of on-heap management of the map should buy us some > benefit. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-9454) Log WARN on Multi Partition IN clause Queries
[ https://issues.apache.org/jira/browse/CASSANDRA-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp updated CASSANDRA-9454: Reviewers: (was: Robert Stupp) > Log WARN on Multi Partition IN clause Queries > - > > Key: CASSANDRA-9454 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9454 > Project: Cassandra > Issue Type: New Feature > Components: Legacy/CQL >Reporter: Sebastian Estevez >Assignee: T Jake Luciani >Priority: Low > Fix For: 2.2.x > > > Similar to CASSANDRA-6487 but for multi-partition queries. > Show warning (ideally at the client CASSANDRA-8930) when users try to use IN > clauses when clustering columns span multiple partitions. The right way to go > is async requests per partition. > **Update**: Unless the query is CL.ONE and all the partition ranges are on > the node! In which case multi partition IN is okay. > This can cause an OOM > {code} > ERROR [Thread-388] 2015-05-18 12:11:10,147 CassandraDaemon.java (line 199) > Exception in thread Thread[Thread-388,5,main] > java.lang.OutOfMemoryError: Java heap space > ERROR [ReadStage:321] 2015-05-18 12:11:10,147 CassandraDaemon.java (line 199) > Exception in thread Thread[ReadStage:321,5,main] > java.lang.OutOfMemoryError: Java heap space > at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) > at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) > at > org.apache.cassandra.io.util.MappedFileDataInput.readBytes(MappedFileDataInput.java:146) > at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392) > at > org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371) > at > org.apache.cassandra.io.sstable.IndexHelper$IndexInfo.deserialize(IndexHelper.java:187) > at > org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:122) > at > org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:970) > at > org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:871) > at > org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:41) > at > org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:167) > at > org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62) > at > org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250) > at > org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53) > at > org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1547) > at > org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1376) > at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:327) > at > org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65) > at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > {code} > By flooding heap with: > {code}org.apache.cassandra.io.sstable.IndexHelper$IndexInfo{code} > taken from: > http://stackoverflow.com/questions/30366729/out-of-memory-error-in-cassandra-when-querying-big-rows-containing-a-collection -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp updated CASSANDRA-15922: - Reviewers: Benedict Elliott Smith, Robert Stupp, Robert Stupp (was: Benedict Elliott Smith, Robert Stupp) Benedict Elliott Smith, Robert Stupp, Robert Stupp (was: Benedict Elliott Smith, Robert Stupp) Status: Review In Progress (was: Patch Available) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), us
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Stupp updated CASSANDRA-15922: - Reviewers: Benedict Elliott Smith, Robert Stupp > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further sig
[jira] [Assigned] (CASSANDRA-15923) Collection types written via prepared statement not checked for nulls
[ https://issues.apache.org/jira/browse/CASSANDRA-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andres de la Peña reassigned CASSANDRA-15923: - Assignee: Andres de la Peña > Collection types written via prepared statement not checked for nulls > - > > Key: CASSANDRA-15923 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15923 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Tom van der Woerdt >Assignee: Andres de la Peña >Priority: Normal > > To reproduce: > {code:java} > >>> cluster = Cluster() > >>> session = cluster.connect() > >>> session.execute("create keyspace frozen_int_test with replication = > >>> {'class': 'SimpleStrategy', 'replication_factor': 1}") > >>> session.execute("create table frozen_int_test.mytable (id int primary > >>> key, value frozen>)") > >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, > >>> value) values (?, ?)"), (1, [1,2,3])) > >>> list(session.execute("select * from frozen_int_test.mytable")) > [Row(id=1, value=[1, 2, 3])] > >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, > >>> value) values (?, ?)"), (1, [1,2,None])) > >>> list(session.execute("select * from frozen_int_test.mytable")) > [Row(id=1, value=[1, 2, None])] {code} > Now you might say "But Tom, that just shows that it works!", but this does > not work as a CQL literal: > {code:java} > >>> session.execute("insert into frozen_int_test.mytable (id, value) values > >>> (1, [1,2,null])") > [...] cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] > message="null is not supported inside collections" {code} > Worse, if a mutation like this makes it way into the hints, it will be > retried indefinitely as it fails validation with a NullPointerException: > {code:java} > ERROR [MutationStage-11] 2020-07-06 09:23:25,696 > AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread > Thread[MutationStage-11,5,main] > java.lang.NullPointerException: null > at > org.apache.cassandra.serializers.Int32Serializer.validate(Int32Serializer.java:41) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.serializers.ListSerializer.validateForNativeProtocol(ListSerializer.java:70) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.serializers.CollectionSerializer.validate(CollectionSerializer.java:56) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.db.marshal.AbstractType.validate(AbstractType.java:162) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.db.marshal.AbstractType.validateCellValue(AbstractType.java:196) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.db.marshal.CollectionType.validateCellValue(CollectionType.java:124) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.config.ColumnDefinition.validateCell(ColumnDefinition.java:410) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.db.rows.AbstractCell.validate(AbstractCell.java:154) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.db.partitions.PartitionUpdate.validate(PartitionUpdate.java:486) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at java.util.Collections$SingletonSet.forEach(Collections.java:4769) > ~[na:1.8.0_252] > at > org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:69) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_252] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) > ~[apache-cassandra-3.11.6.jar:3.11.6] > at > org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137) > [apache-cassandra-3.11.6.jar:3.11.6] > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) > [apache-cassandra-3.11.6.jar:3.11.6] > at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252] {code} > A similar problem is reproducible when writing into a non-frozen column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152050#comment-17152050 ] Josh McKenzie commented on CASSANDRA-15234: --- {quote}{color:#172b4d}It is customary that, before work is committed, alternative proposals are engaged with on their technical merits. It occurs to me we recently worked to mandate this as part of the process, in fact. In this case we {color}_seem_{color:#172b4d} in danger of subordinating this to beliefs about scheduling.{color} {quote} {color:#172b4d}Unfortunately it's customary on this project to do that right up until the last moment before something is committed (and even beyond) with no weighting of the value on things actually being in the hands of users vs. in tree and unreleased.{color} {color:#172b4d}We have un-tested beliefs about a potentially superior design against un-tested beliefs about the negative impact of the project on further delay. This is not a situation in which we can expect to make progress on a discussion until and unless both sides collect some empirical evidence about their position as well as spend real time investigating and exploring the position of other people engaged.{color} {color:#172b4d}Unfortunately I'm well past the time I personally have available to engage on this ticket; I'll defer to other people to take it from here.{color} > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-10968) When taking snapshot, manifest.json contains incorrect or no files when column family has secondary indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152040#comment-17152040 ] Andres de la Peña commented on CASSANDRA-10968: --- The fix looks good to me. I have run CI again: ||branch||utest||dtest|| |2.1 |[167|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/167/]|[201|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/201/]| |2.2 |[168|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/168/]|[202|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/202/]| |3.0 |[169|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/169/]|[203|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/203/]| |3.11|[170|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/170/]|[204|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/204/]| |4.0 |[171|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/171/]|[205|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/205/]| Dtests haven't finished yet. 2.1 CI seems to have failed to build, and indeed we don't seem to have a regular CI build for it. For 2.2 there are failures in {{SSTableRewriterTest}} that also happen in the base branch, so they don't seem related. For sure we should at least apply the fix since 2.2. I'm not sure this is critical enough for 2.1, but the fix is quite small so probably it won't be a problem if we include that branch, although the lack of CI for the branch is a bit worrying. > When taking snapshot, manifest.json contains incorrect or no files when > column family has secondary indexes > --- > > Key: CASSANDRA-10968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10968 > Project: Cassandra > Issue Type: Bug > Components: Feature/2i Index >Reporter: Fred A >Assignee: Aleksandr Sorokoumov >Priority: Normal > Labels: lhf > Fix For: 2.2.x, 3.0.x, 3.11.x > > Time Spent: 1.5h > Remaining Estimate: 0h > > xNoticed indeterminate behaviour when taking snapshot on column families that > has secondary indexes setup. The created manifest.json created when doing > snapshot, sometimes contains no file names at all and sometimes some file > names. > I don't know if this post is related but that was the only thing I could find: > http://www.mail-archive.com/user%40cassandra.apache.org/msg42019.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152024#comment-17152024 ] Berenguer Blasi edited comment on CASSANDRA-15406 at 7/6/20, 1:48 PM: -- Ok so here's an opportunity for CASSANDRA-15502 or even as the quality 4.0 effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending CI lgtm. was (Author: bereng): Ok so here's an opportunity for CASSANDRA-15502 or even as they quality 4.0 effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending CI lgtm. > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152024#comment-17152024 ] Berenguer Blasi commented on CASSANDRA-15406: - Ok so here's an opportunity for CASSANDRA-15502 or even as they quality 4.0 effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending CI lgtm. > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152011#comment-17152011 ] Berenguer Blasi edited comment on CASSANDRA-15406 at 7/6/20, 1:42 PM: -- Would it make sense to add some testing either in a junit hijacking {{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even in {{nodetool_test.py}} dtest? I am aware the nodetool tool doesn't have this sort of testing atm (only some dtests) so we can use this as an opportunity to kickstart now. You can also tell me you'd rather do this in another ticket as it's not a quick fix :-) depending on how OCD on testing you feel on this one. was (Author: bereng): Would it make sense to add some testing either in a junit hijacking {{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even in {{nodetool_test.py}} dtest. I am aware the nodetool tool doesn't have this sort of testing atm (only some dtests) so we can use this as an opportunity to kickstart now. You can also tell me you'd rather do this in another ticket as it's not a quick fix :-) depending on how OCD on testing you feel on this one. > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 1:23 PM: - Patch updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values in the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 CI run at https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline was (Author: michaelsembwever): Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values in the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 CI run at https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two sc
[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152014#comment-17152014 ] Stefan Miklosovic commented on CASSANDRA-15406: --- My OCD here is pretty weak, I would just move it to other PR, I was talking about testing of nodetools output with [~ifesdjeen] and how we could go about that in testing framework but no luck so far ... > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152011#comment-17152011 ] Berenguer Blasi commented on CASSANDRA-15406: - Would it make sense to add some testing either in a junit hijacking {{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even in {{nodetool_test.py}} dtest. I am aware the nodetool tool doesn't have this sort of testing atm (only some dtests) so we can use this as an opportunity to kickstart now. You can also tell me you'd rather do this in another ticket as it's not a quick fix :-) depending on how OCD on testing you feel on this one. > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152010#comment-17152010 ] Benedict Elliott Smith commented on CASSANDRA-15922: +1 > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher t
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152006#comment-17152006 ] Robert Stupp commented on CASSANDRA-15922: -- +1 (assuming CI looks good and 3.11+3.0 back-ports are clean) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster.
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:51 PM: -- Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values in the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 CI run at https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline was (Author: michaelsembwever): Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 CI run at https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two s
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:51 PM: -- Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 CI run at https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline was (Author: michaelsembwever): Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionC
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:42 PM: -- Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 was (Author: michaelsembwever): Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (r
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993 ] Michael Semb Wever commented on CASSANDRA-15922: Patched updated to - use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability - remove the {{allocCount}} AtomicInteger field - don't print negative waste values the {{toString(..)}} method (when region is full and nextFreeOffset is passed capacity) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Test and Documentation Plan: existing CI. benchmarking in ticket. was:existing CI. Status: Patch Available (was: In Progress) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151970#comment-17151970 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:04 PM: -- h4. {{addAndGet}} Experiments Code patch at https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result that returns the value from no overlapping/latter add calls. AFAIK that is how AtomicInteger works. I'm also curious if we still need the {{allocCount}} AtomicInteger field, it appears to be there only for debug. May I remove it in this patch? Benchmark code attached in [^NativeAllocatorRegion2Test.java]. The following attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads, comparing the original code (compareAndSet), the addAndGet, and the constant backoff (parkNano) approaches. The biggest improvement is still the constant backoff algorithm where performance is one order of magnitude faster. But the addAndGet approach is 2x to 5x faster than the original, and as mentioned above it also comes with the benefit of no-loop (no starvation) and faster performance in all workloads. !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! was (Author: michaelsembwever): h4. {{addAndGet}} Experiments Code patch at https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result that returns the value from no overlapping/latter add calls. AFAIK that is how AtomicInteger works. I'm also curious if we still need the {{allocCount}} AtomicInteger field, it appears to be there only for debug. May I remove it in this patch? Benchmark code attached in [^NativeAllocatorRegion2Test.java]. The following attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads, comparing the original code (compareAndSet), the addAndGet, and the constant backoff (parkNano) approaches. The biggest improvement is still the algorithm with a park time of 1ns where performance is one order of magnitude faster. The addAndGet approach is 2x to 5x faster than the original. As mentioned above it also comes with the benefit of no-loop (no starvation) and faster performance in all workloads. !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simpl
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151970#comment-17151970 ] Michael Semb Wever commented on CASSANDRA-15922: h4. {{addAndGet}} Experiments Code patch at https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1 This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result that returns the value from no overlapping/latter add calls. AFAIK that is how AtomicInteger works. I'm also curious if we still need the {{allocCount}} AtomicInteger field, it appears to be there only for debug. May I remove it in this patch? Benchmark code attached in [^NativeAllocatorRegion2Test.java]. The following attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads, comparing the original code (compareAndSet), the addAndGet, and the constant backoff (parkNano) approaches. The biggest improvement is still the algorithm with a park time of 1ns where performance is one order of magnitude faster. The addAndGet approach is 2x to 5x faster than the original. As mentioned above it also comes with the benefit of no-loop (no starvation) and faster performance in all workloads. !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllo
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151969#comment-17151969 ] Robert Stupp commented on CASSANDRA-15922: -- +1 on {{addAndGet}} (or {{getAndAdd}}, whichever works best). And I agree, the allocation-model that we currently have is not great, but as you said, it's a ton of work to get it right (less (ideally no) fragmentation, no unnecessary tiny allocations, no unnecessary copying, etc etc). > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Attachment: NativeAllocatorRegion2Test.java > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegion2Test.java, > NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, > Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at > 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 > at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot > 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even fur
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Attachment: Screen Shot 2020-07-06 at 13.26.10.png > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at > 13.26.10.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant drop, especi
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151952#comment-17151952 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 10:53 AM: -- bq. I assume in this case one of the problems is that we are allocating huge numbers of small objects, so that a small number of threads are competing over-and-over again to allocate the same data. We should not be competing for each Cell allocation, and instead try to allocate all the buffers for e.g. at least a Row at once. This is correct. Rows with ~many hundreds of double cells. bq. There is perhaps a better alternative: use addAndGet->if instead of read->if->compareAndSet, i.e. unconditionally update the pointer, then determine whether or not you successfully allocated in the aftermath. This is guaranteed to succeed in one step; contention can slow that step down modestly, but there is no wasted competition. Sounds good. Will put it together and test. was (Author: michaelsembwever): bq. I assume in this case one of the problems is that we are allocating huge numbers of small objects, so that a small number of threads are competing over-and-over again to allocate the same data. We should not be competing for each Cell allocation, and instead try to allocate all the buffers for e.g. at least a Row at once. This is correct. Rows with ~thousands of double cells. bq. There is perhaps a better alternative: use addAndGet->if instead of read->if->compareAndSet, i.e. unconditionally update the pointer, then determine whether or not you successfully allocated in the aftermath. This is guaranteed to succeed in one step; contention can slow that step down modestly, but there is no wasted competition. Sounds good. Will put it together and test. > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151952#comment-17151952 ] Michael Semb Wever commented on CASSANDRA-15922: bq. I assume in this case one of the problems is that we are allocating huge numbers of small objects, so that a small number of threads are competing over-and-over again to allocate the same data. We should not be competing for each Cell allocation, and instead try to allocate all the buffers for e.g. at least a Row at once. This is correct. Rows with ~thousands of double cells. bq. There is perhaps a better alternative: use addAndGet->if instead of read->if->compareAndSet, i.e. unconditionally update the pointer, then determine whether or not you successfully allocated in the aftermath. This is guaranteed to succeed in one step; contention can slow that step down modestly, but there is no wasted competition. Sounds good. Will put it together and test. > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement
[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151948#comment-17151948 ] Benedict Elliott Smith commented on CASSANDRA-15234: bq. The valid concern around the api churn it introduces is addressed by committing the new ticket to 4.0-beta. This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. > I'd be curious what your perspective is on how we determine what qualifies as > justified It is customary that, before work is committed, alternative proposals are engaged with on their technical merits. It occurs to me we recently worked to mandate this as part of the process, in fact. In this case we _seem_ in danger of subordinating this to beliefs about scheduling. If you like, I can formulate a legally airtight veto, but my goal is only for you to engage briefly with the proposal and determine for yourselves which is superior. If the new proposal is _technically_* superior, and of similar complexity, then you are my justification. If you disagree, however - and importantly we agree that we do not intend to pursue the alternative approach in future - I would consider my veto invalid (and would anyway withdraw it). > having heard of the proximity of the beta Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? This should surely alleviate concerns around delaying 4.0? I do understand the imperative to get 4.0 out the door, but I also know we all want to ship the best product we can as well. If we can achieve both, we should. APIs matter, and avoiding API churn is an important part of our user/operator story. * I _hope_ we can avoid an epistemic battle about the word "technical," and accept that API design is a technical endeavour to convey meaning. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? > s(econds?)? > m(inutes?)? > h(ours?)? > d(ays?)? > mo(nths?)? > {{code}} > For rate units, I would propose parsing any of the standard {{B/s, KiB/s, > MiB/s, GiB/s, TiB/s}}. > Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or > powers of 1000 such as {{KB/s}}, given these are regularly used for either > their old or new definition e.g. {{KiB/s}}, or we could support them and > simply log the value in bytes/s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters
[ https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151948#comment-17151948 ] Benedict Elliott Smith edited comment on CASSANDRA-15234 at 7/6/20, 10:38 AM: -- bq. The valid concern around the api churn it introduces is addressed by committing the new ticket to 4.0-beta. This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. > I'd be curious what your perspective is on how we determine what qualifies as > justified It is customary that, before work is committed, alternative proposals are engaged with on their technical merits. It occurs to me we recently worked to mandate this as part of the process, in fact. In this case we _seem_ in danger of subordinating this to beliefs about scheduling. If you like, I can formulate a legally airtight veto, but my goal is only for you to engage briefly with the proposal and determine for yourselves which is superior. If the new proposal is _technically_* superior, and of similar complexity, then you are my justification. If you disagree, however - and importantly we agree that we do not intend to pursue the alternative approach in future - I would consider my veto invalid (and would anyway withdraw it). > having heard of the proximity of the beta Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? This should surely alleviate concerns around delaying 4.0? I do understand the imperative to get 4.0 out the door, but I also know we all want to ship the best product we can as well. If we can achieve both, we should. APIs matter, and avoiding API churn is an important part of our user/operator story. \* I _hope_ we can avoid an epistemic battle about the word "technical," and accept that API design is a technical endeavour to convey meaning. was (Author: benedict): bq. The valid concern around the api churn it introduces is addressed by committing the new ticket to 4.0-beta. This ticket’s API replaces the current API, is mutually exclusive with the alternative proposal, and would be deprecated by it. If we introduce them both in 4.0-beta, we must maintain them both and go through the full deprecation process. So unfortunately no churn is avoided. > I'd be curious what your perspective is on how we determine what qualifies as > justified It is customary that, before work is committed, alternative proposals are engaged with on their technical merits. It occurs to me we recently worked to mandate this as part of the process, in fact. In this case we _seem_ in danger of subordinating this to beliefs about scheduling. If you like, I can formulate a legally airtight veto, but my goal is only for you to engage briefly with the proposal and determine for yourselves which is superior. If the new proposal is _technically_* superior, and of similar complexity, then you are my justification. If you disagree, however - and importantly we agree that we do not intend to pursue the alternative approach in future - I would consider my veto invalid (and would anyway withdraw it). > having heard of the proximity of the beta Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? This should surely alleviate concerns around delaying 4.0? I do understand the imperative to get 4.0 out the door, but I also know we all want to ship the best product we can as well. If we can achieve both, we should. APIs matter, and avoiding API churn is an important part of our user/operator story. * I _hope_ we can avoid an epistemic battle about the word "technical," and accept that API design is a technical endeavour to convey meaning. > Standardise config and JVM parameters > - > > Key: CASSANDRA-15234 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15234 > Project: Cassandra > Issue Type: Bug > Components: Local/Config >Reporter: Benedict Elliott Smith >Assignee: Ekaterina Dimitrova >Priority: Normal > Fix For: 4.0-alpha > > Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt > > > We have a bunch of inconsistent names and config patterns in the codebase, > both from the yams and JVM properties. It would be nice to standardise the > naming (such as otc_ vs internode_) as well as the provision of values with > units - while maintaining perpetual backwards compatibility with the old > parameter names, of course. > For temporal units, I would propose parsing strings with suffixes of: > {{code}} > u|micros(econds?)? > ms|millis(econds?)? >
[jira] [Created] (CASSANDRA-15923) Collection types written via prepared statement not checked for nulls
Tom van der Woerdt created CASSANDRA-15923: -- Summary: Collection types written via prepared statement not checked for nulls Key: CASSANDRA-15923 URL: https://issues.apache.org/jira/browse/CASSANDRA-15923 Project: Cassandra Issue Type: Bug Components: Messaging/Client Reporter: Tom van der Woerdt To reproduce: {code:java} >>> cluster = Cluster() >>> session = cluster.connect() >>> session.execute("create keyspace frozen_int_test with replication = >>> {'class': 'SimpleStrategy', 'replication_factor': 1}") >>> session.execute("create table frozen_int_test.mytable (id int primary key, >>> value frozen>)") >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, >>> value) values (?, ?)"), (1, [1,2,3])) >>> list(session.execute("select * from frozen_int_test.mytable")) [Row(id=1, value=[1, 2, 3])] >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, >>> value) values (?, ?)"), (1, [1,2,None])) >>> list(session.execute("select * from frozen_int_test.mytable")) [Row(id=1, value=[1, 2, None])] {code} Now you might say "But Tom, that just shows that it works!", but this does not work as a CQL literal: {code:java} >>> session.execute("insert into frozen_int_test.mytable (id, value) values (1, >>> [1,2,null])") [...] cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="null is not supported inside collections" {code} Worse, if a mutation like this makes it way into the hints, it will be retried indefinitely as it fails validation with a NullPointerException: {code:java} ERROR [MutationStage-11] 2020-07-06 09:23:25,696 AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread Thread[MutationStage-11,5,main] java.lang.NullPointerException: null at org.apache.cassandra.serializers.Int32Serializer.validate(Int32Serializer.java:41) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.serializers.ListSerializer.validateForNativeProtocol(ListSerializer.java:70) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.serializers.CollectionSerializer.validate(CollectionSerializer.java:56) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.marshal.AbstractType.validate(AbstractType.java:162) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.marshal.AbstractType.validateCellValue(AbstractType.java:196) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.marshal.CollectionType.validateCellValue(CollectionType.java:124) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.config.ColumnDefinition.validateCell(ColumnDefinition.java:410) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.rows.AbstractCell.validate(AbstractCell.java:154) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.partitions.PartitionUpdate.validate(PartitionUpdate.java:486) ~[apache-cassandra-3.11.6.jar:3.11.6] at java.util.Collections$SingletonSet.forEach(Collections.java:4769) ~[na:1.8.0_252] at org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:69) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.11.6.jar:3.11.6] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_252] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137) [apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) [apache-cassandra-3.11.6.jar:3.11.6] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252] {code} A similar problem is reproducible when writing into a non-frozen column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Status: In Progress (was: Patch Available) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant drop, especially at low contention rates. > !Screen Shot 2020
[jira] [Updated] (CASSANDRA-10968) When taking snapshot, manifest.json contains incorrect or no files when column family has secondary indexes
[ https://issues.apache.org/jira/browse/CASSANDRA-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andres de la Peña updated CASSANDRA-10968: -- Reviewers: Andres de la Peña > When taking snapshot, manifest.json contains incorrect or no files when > column family has secondary indexes > --- > > Key: CASSANDRA-10968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10968 > Project: Cassandra > Issue Type: Bug > Components: Feature/2i Index >Reporter: Fred A >Assignee: Aleksandr Sorokoumov >Priority: Normal > Labels: lhf > Fix For: 2.2.x, 3.0.x, 3.11.x > > Time Spent: 50m > Remaining Estimate: 0h > > xNoticed indeterminate behaviour when taking snapshot on column families that > has secondary indexes setup. The created manifest.json created when doing > snapshot, sometimes contains no file names at all and sometimes some file > names. > I don't know if this post is related but that was the only thing I could find: > http://www.mail-archive.com/user%40cassandra.apache.org/msg42019.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151940#comment-17151940 ] Benedict Elliott Smith commented on CASSANDRA-15922: There is perhaps a better alternative: use {{addAndGet->if}} instead of {{read->if->compareAndSet}}, i.e. unconditionally update the pointer, then determine whether or not you successfully allocated in the aftermath. This is guaranteed to succeed in one step; contention can slow that step down modestly, but there is no wasted competition. There is no downside to this approach with the {{NativeAllocator}}, either, since if we fail to allocate we always swap the {{Region}}, so consuming more than we need when smaller allocations may have been possible is not a problem. So we should have made this change a long time ago, really. It _might_ be that this approach still sees some slowdown: I assume in this case one of the problems is that we are allocating huge numbers of small objects, so that a small number of threads are competing over-and-over again to allocate the same data. We should not be competing for each {{Cell}} alloation, and instead try to allocate all the buffers for e.g. at least a {{Row}} at once. But this is more involved. Ideally we would improve the allocator itself, which is very under-engineered, but with our threading model that's more challenging than we might like. The _upside_ to this approach is that ordinary workloads should be _improved_, and there is no possibility of thread starvation. The current proposal by contrast introduces much longer windows for thread starvation, and _might_ negatively impact tail latencies. This is a very difficult thing for us to rule out, so the work required to demonstrate it is performance neutral could be prohibitive. > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code}
[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151869#comment-17151869 ] Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 9:43 AM: - bq. Although the same change probably needs to be applied to org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate as well. Added to patch. bq. there's a slight issue in the attached NativeAllocatorRegionTest.java Region.allocate() method that adds another CAS (casFailures) to every failed CAS against nextFreeOffset. It's probably better to count the number of failed CAS's in a local variable and add it to this.casFailures when the test's Region.allocate() returns. Fixed and re-running tests. Thanks [~snazy]. EDIT: new screenshots uploaded. results and conclusions stay the same. was (Author: michaelsembwever): bq. Although the same change probably needs to be applied to org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate as well. Added to patch. bq. there's a slight issue in the attached NativeAllocatorRegionTest.java Region.allocate() method that adds another CAS (casFailures) to every failed CAS against nextFreeOffset. It's probably better to count the number of failed CAS's in a local variable and add it to this.casFailures when the test's Region.allocate() returns. Fixed and re-running tests. Thanks [~snazy]. > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied fr
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Attachment: Screen Shot 2020-07-06 at 11.35.35.png Screen Shot 2020-07-06 at 11.36.44.png > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Attachment: NativeAllocatorRegionTest.java > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 > at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot > 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen > Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, > Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant drop, especially at low contention rates. > !Screen Shot 2020-
[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)
[ https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Semb Wever updated CASSANDRA-15922: --- Attachment: (was: NativeAllocatorRegionTest.java) > High CAS failures in NativeAllocator.Region.allocate(..) > - > > Key: CASSANDRA-15922 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15922 > Project: Cassandra > Issue Type: Bug > Components: Local/Memtable >Reporter: Michael Semb Wever >Assignee: Michael Semb Wever >Priority: Normal > Fix For: 4.0, 3.0.x, 3.11.x > > Attachments: Screen Shot 2020-07-05 at 13.16.10.png, Screen Shot > 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 13.35.55.png, Screen > Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 at 13.48.16.png, > profile_pbdpc23zafsrh_20200702.svg > > > h4. Problem > The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} > for the current offset in the region. Allocations depends on a > {{.compareAndSet(..)}} call. > In highly contended environments the CAS failures can be high, starving > writes in a running Cassandra node. > h4. Example > It has been witnessed up to 33% of CPU time stuck in the > {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during > a heavy spark analytics write load. > These nodes: 40 CPU cores and 256GB ram; have relevant settings > - {{memtable_allocation_type: offheap_objects}} > - {{memtable_offheap_space_in_mb: 5120}} > - {{concurrent_writes: 160}} > Numerous flamegraphs demonstrate the problem. See attached > [^profile_pbdpc23zafsrh_20200702.svg]. > h4. Suggestion: ThreadLocal Regions > One possible solution is to have separate Regions per thread. > Code wise this is relatively easy to do, for example replacing > NativeAllocator:59 > {code}private final AtomicReference currentRegion = new > AtomicReference<>();{code} > with > {code}private final ThreadLocal> currentRegion = new > ThreadLocal<>() {...};{code} > But this approach substantially changes the allocation behaviour, with more > than concurrent_writes number of Regions in use at any one time. For example > with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. > h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff) > Another possible solution is to introduce a contention management algorithm > that a) reduces CAS failures in high contention environments, b) doesn't > impact normal environments, and c) keeps the allocation strategy of using one > region at a time. > The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] > describes this contention CAS problem and demonstrates a number of algorithms > to apply. The simplest of these algorithms is the Constant Backoff CAS > Algorithm. > Applying the Constant Backoff CAS Algorithm involves adding one line of code > to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant > number) nanoseconds after a CAS failure occurs. > That is... > {code} > // we raced and lost alloc, try again > LockSupport.parkNanos(1); > {code} > h4. Constant Backoff CAS Algorithm Experiments > Using the code attached in NativeAllocatorRegionTest.java the concurrency and > CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. > In the attached [^NativeAllocatorRegionTest.java] class, which can be run > standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has > also the {{casFailures}} field added. The following two screenshots are from > data collected from this class on a 6 CPU (12 core) MBP, running the > {{NativeAllocatorRegionTest.testRegionCAS}} method. > This attached screenshot shows the number of CAS failures during the life of > a Region (over ~215 millions allocations), using different threads and park > times. This illustrates the improvement (reduction) of CAS failures from zero > park time, through orders of magnitude, up to 1000ns (10ms). The biggest > improvement is from no algorithm to a park time of 1ns where CAS failures are > ~two orders of magnitude lower. From a park time 10μs and higher there is a > significant drop also at low contention rates. > !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! > This attached screenshot shows the time it takes to fill a Region (~215 > million allocations), using different threads and park times. The biggest > improvement is from no algorithm to a park time of 1ns where performance is > one order of magnitude faster. From a park time of 100μs and higher there is > a even further significant drop, especially at low contention rates. > !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! > Repeating the test run show reliably similar results: [^Screen Sh
[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)
[ https://issues.apache.org/jira/browse/CASSANDRA-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151899#comment-17151899 ] Michael Semb Wever commented on CASSANDRA-15901: Just the unit tests (on cassandra13) at https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-test/165 Full devbranch pipeline (now that we're touching runtime code) at https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/196/ > Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip) > > > Key: CASSANDRA-15901 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15901 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Berenguer Blasi >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > Many of the ci-cassandra jenkins runs fail on {{ip-10-0-5-5: Name or service > not known}}. CASSANDRA-15622 addressed some of these but many still remain. > Currently test C* nodes are either failing or listening on a public ip > depending on which agent they end up. > The idea behind this ticket is to make ant force the private VPC ip in the > cassandra yaml when building, this will force the nodes to listen on the > correct ip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)
[ https://issues.apache.org/jira/browse/CASSANDRA-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151890#comment-17151890 ] Berenguer Blasi commented on CASSANDRA-15901: - [~snazy] I did accept your commits but for a single one on wording. Please take a look at it and the {{JMXAuthTest}} failure, which is a hostname resolution error in some library. Wdyt? [~mck] could you be so kind to fire a run against cassandra13 please :-)? I am running a full CI on circle as well. If all goes well that should be it. > Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip) > > > Key: CASSANDRA-15901 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15901 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Berenguer Blasi >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > Many of the ci-cassandra jenkins runs fail on {{ip-10-0-5-5: Name or service > not known}}. CASSANDRA-15622 addressed some of these but many still remain. > Currently test C* nodes are either failing or listening on a public ip > depending on which agent they end up. > The idea behind this ticket is to make ant force the private VPC ip in the > cassandra yaml when building, this will force the nodes to listen on the > correct ip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)
[ https://issues.apache.org/jira/browse/CASSANDRA-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151874#comment-17151874 ] Berenguer Blasi commented on CASSANDRA-15901: - So it seems after all the back and forth and given the current restrictions mainly: * Not providing a synthetic address load which the test actually doesn't need * Avoiding failing on mis-config nodes Adding a 3rd fallback seems reasonable enough, moving from failing scenarios to a 'localhost' listen in a major release. Both {{getLocalHost()}} and {{getLoopbackAddress()}} may fail under some OS/ip configs so we're not anyway worse than we used to be. The change has been pushed. It's undergoing review and an initial test run [fired|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-test/161/]. Iirc at least the JMXAuth test is a legit failure I need to look into. Once that's done I'll run a full test suite as this is touching C* code: dtests, unit tests, jvm, etc... > Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip) > > > Key: CASSANDRA-15901 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15901 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest >Reporter: Berenguer Blasi >Assignee: Berenguer Blasi >Priority: Normal > Fix For: 4.0-rc > > > Many of the ci-cassandra jenkins runs fail on {{ip-10-0-5-5: Name or service > not known}}. CASSANDRA-15622 addressed some of these but many still remain. > Currently test C* nodes are either failing or listening on a public ip > depending on which agent they end up. > The idea behind this ticket is to make ant force the private VPC ip in the > cassandra yaml when building, this will force the nodes to listen on the > correct ip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build
[ https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151873#comment-17151873 ] Benjamin Lerer commented on CASSANDRA-15406: [~stefan.miklosovic] Thanks. New Jenkins [run|https://ci-cassandra.apache.org/job/Cassandra-devbranch/195/] > Add command to show the progress of data streaming and index build > --- > > Key: CASSANDRA-15406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15406 > Project: Cassandra > Issue Type: Improvement > Components: Consistency/Streaming, Legacy/Streaming and Messaging, > Tool/nodetool >Reporter: maxwellguo >Assignee: Stefan Miklosovic >Priority: Normal > Fix For: 4.0, 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > I found that we should supply a command to show the progress of streaming > when we do the operation of bootstrap/move/decommission/removenode. For when > do data streaming , noboday knows which steps there program are in , so I > think a command to show the joing/leaving node's is needed . > > PR [https://github.com/apache/cassandra/pull/558] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org