date:20200706



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-9739:

Resolution: Won't Fix
Status: Resolved  (was: Open)

> Migrate counter-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9739
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9739
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Core
>Reporter: Robert Stupp
>Assignee: Robert Stupp
>Priority: Normal
> Fix For: 4.x
>
>
> Counter cache still uses a concurrent map on-heap. This could go to off-heap 
> and feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15803) Separate out allow filtering scanning through a partition versus scanning over the table



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Hanna updated CASSANDRA-15803:
-
Description: 
Currently allow filtering can mean two things in the spirit of "avoid 
operations that don't seek to a specific row or sequential rows of data."  
First, it can mean scanning across the entire table to meet the criteria of the 
query.  That's almost always a bad thing and should be discouraged or disabled 
(see CASSANDRA-8303).  Second, it can mean filtering within a specific 
partition.  For example, in a query you could specify the full partition key 
and if you specify a criterion on a non-key field, it requires allow filtering.

The second reason to require allow filtering is significantly less work to scan 
through a partition.  It is still extra work over seeking to a specific row and 
getting N sequential rows though.  So while an application developer and/or 
operator needs to be cautious about this second type, it's not necessarily a 
bad thing, depending on the table and the use case.

I propose that we separate the way to specify allow filtering across an entire 
table from specifying allow filtering across a partition in a backwards 
compatible way.  One idea that was brought up in Slack in the cassandra-dev 
room was to have allow filtering mean the superset - scanning across the table. 
 Then if you want to specify that you *only* want to scan within a partition 
you would use something like

{{ALLOW FILTERING [WITHIN PARTITION]}}

So it will succeed if you specify non-key criteria within a single partition, 
but fail with a message to say it requires the full allow filtering.  This 
would allow for a backwards compatible full allow filtering while allowing a 
user to specify that they want to just scan within a partition, but error out 
if trying to scan a full table.

This is potentially also related to the capability limitation framework by 
which operators could more granularly specify what features are allowed or 
disallowed per user, discussed in CASSANDRA-8303.  This way an operator could 
disallow the more general allow filtering while allowing the partition scan (or 
disallow them both at their discretion).

  was:
Currently allow filtering can mean two things in the spirit of "avoid 
operations that don't seek to a specific row or sequential rows of data."  
First, it can mean scanning across the entire table to meet the criteria of the 
query.  That's almost always a bad thing and should be discouraged or disabled 
(see CASSANDRA-8303).  Second, it can mean filtering within a specific 
partition.  For example, in a query you could specify the full partition key 
and if you specify a criterion on a non-key field, it requires allow filtering.

The second reason to require allow filtering is significantly less work to scan 
through a partition.  It is still extra work over seeking to a specific row and 
getting N sequential rows though.  So while an application developer and/or 
operator needs to be cautious about this second type, it's not necessarily a 
bad thing, depending on the table and the use case.

I propose that we separate the way to specify allow filtering across an entire 
table (involving a scatter gather) from specifying allow filtering across a 
partition in a backwards compatible way.  One idea that was brought up in Slack 
in the cassandra-dev room was to have allow filtering mean the superset - 
scanning across the table.  Then if you want to specify that you *only* want to 
scan within a partition you would use something like

{{ALLOW FILTERING [WITHIN PARTITION]}}

So it will succeed if you specify non-key criteria within a single partition, 
but fail with a message to say it requires the full allow filtering.  This 
would allow for a backwards compatible full allow filtering while allowing a 
user to specify that they want to just scan within a partition, but error out 
if trying to scan a full table.

This is potentially also related to the capability limitation framework by 
which operators could more granularly specify what features are allowed or 
disallowed per user, discussed in CASSANDRA-8303.  This way an operator could 
disallow the more general allow filtering while allowing the partition scan (or 
disallow them both at their discretion).


> Separate out allow filtering scanning through a partition versus scanning 
> over the table
> 
>
> Key: CASSANDRA-15803
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15803
> Project: Cassandra
>  Issue Type: Improvement
>  Components: CQL/Syntax
>Reporter: Jeremy Hanna
>Priority: Normal
>
> Currently allow filtering can mean two things in the spirit of "avoid 
> operations that don't seek to a specific row or sequ

[jira] [Comment Edited] (CASSANDRA-13701) Lower default num_tokens



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152460#comment-17152460
 ] 

Jeremy Hanna edited comment on CASSANDRA-13701 at 7/7/20, 3:28 AM:
---

Can we also standardize the tests to use the default values - that is, from 32 
to the new defaults (16 {{num_tokens}} with 
{{allocate_tokens_for_local_replication_factor=3}} uncommented).


was (Author: jeromatron):
Can we also standardize the tests to use the default values - that is, from 32 
to the new defaults (16 {{num_tokens}} with 
{{allocate_tokens_for_local_replication_factor=3}} uncommented.

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Jeremy Hanna
>Priority: Low
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13701) Lower default num_tokens



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152460#comment-17152460
 ] 

Jeremy Hanna commented on CASSANDRA-13701:
--

Can we also standardize the tests to use the default values - that is, from 32 
to the new defaults (16 {{num_tokens}} with 
{{allocate_tokens_for_local_replication_factor=3}} uncommented.

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Jeremy Hanna
>Priority: Low
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-13701) Lower default num_tokens



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Hanna updated CASSANDRA-13701:
-
Test and Documentation Plan: Associated documentation about num_tokens is 
in 
[https://cassandra.apache.org/doc/latest/getting_started/production.html#tokens]
 as part of CASSANDRA-15618 as well as upgrading information in NEWS.txt.
 Status: Patch Available  (was: In Progress)

Pull request: https://github.com/apache/cassandra/pull/663

> Lower default num_tokens
> 
>
> Key: CASSANDRA-13701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13701
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Chris Lohfink
>Assignee: Jeremy Hanna
>Priority: Low
>
> For reasons highlighted in CASSANDRA-7032, the high number of vnodes is not 
> necessary. It is very expensive for operations processes and scanning. Its 
> come up a lot and its pretty standard and known now to always reduce the 
> num_tokens within the community. We should just lower the defaults.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 2:04 AM:
--

I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times on the latest 
trunk. (when I opened this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost as the old script was 
deleted.

 


was (Author: e.dimitrova):
I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times on the latest 
trunk. (when I opened this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost as. the old script was 
deleted.

 

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: test_multiple_repair_log.txt
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 2:03 AM:
--

I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times on the latest 
trunk. (when I opened this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost as. the old script was 
deleted.

 


was (Author: e.dimitrova):
I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost as. the old script was 
deleted.

 

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: test_multiple_repair_log.txt
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-15894:

Discovered By: User Report  (was: Unit Test)

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: test_multiple_repair_log.txt
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-15894:

Attachment: test_multiple_repair_log.txt

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
> Attachments: test_multiple_repair_log.txt
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 1:59 AM:
--

I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost as. the old script was 
deleted.

 


was (Author: e.dimitrova):
I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost.

 

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15894 at 7/7/20, 1:59 AM:
--

I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning behind the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost.

 


was (Author: e.dimitrova):
I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning between the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost.

 

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152418#comment-17152418
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15894:
-

I think REPAIR was improved recently as I think I saw some related tickets 
closed. Now I managed to run this test successfully 100 times. (when I opened 
this ticket, It was consistently failing on my computer).

But I observed in the log (Attached to this ticket) that from time to time the 
test succeeds only after second run. The first time it times out again.

[~blerer] , as the one assigned to apply expertise to CASSANDRA-15580, what is 
your advice, should we check further this ticket/particular test at this 
moment? Thank you in advance :)

PS I just tried to find out the reasoning between the setup of running the test 
two times before it is considered failed. Unfortunately, back in time these 
tests were moved from a different script to incremental_repair_test.py and at 
that point some of the git historical data is lost.

 

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15592) IllegalStateException in gossip after removing node

2020-07-06 Thread Jai Bheemsen Rao Dhanwada (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152387#comment-17152387
 ] 

Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15592:
---

Hello [~brandon.williams]

I ran into the similar Exception, is there any impact of this ERROR or this is 
just more of logging problem? in my tests I didn't see any impact to the 
cluster operations. so I would like to know the impact of this before even 
attempting to upgrade in production 

 

> IllegalStateException in gossip after removing node
> ---
>
> Key: CASSANDRA-15592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Marcus Olsson
>Assignee: Marcus Olsson
>Priority: Normal
> Fix For: 3.0.21, 3.11.7, 4.0, 4.0-alpha4
>
>
> In one of our test environments we encountered the following exception:
> {noformat}
> 2020-02-02T10:50:13.276+0100 [GossipTasks:1] ERROR 
> o.a.c.u.NoSpamLogger$NoSpamLogStatement:97 log 
> java.lang.IllegalStateException: Attempting gossip state mutation from 
> illegal thread: GossipTasks:1
>  at 
> org.apache.cassandra.gms.Gossiper.checkProperThreadForStateMutation(Gossiper.java:178)
>  at org.apache.cassandra.gms.Gossiper.evictFromMembership(Gossiper.java:465)
>  at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:895)
>  at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:78)
>  at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:240)
>  at 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)
> java.lang.IllegalStateException: Attempting gossip state mutation from 
> illegal thread: GossipTasks:1
>  at 
> org.apache.cassandra.gms.Gossiper.checkProperThreadForStateMutation(Gossiper.java:178)
>  [apache-cassandra-3.11.5.jar:3.11.5]
>  at org.apache.cassandra.gms.Gossiper.evictFromMembership(Gossiper.java:465) 
> [apache-cassandra-3.11.5.jar:3.11.5]
>  at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:895) 
> [apache-cassandra-3.11.5.jar:3.11.5]
>  at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:78) 
> [apache-cassandra-3.11.5.jar:3.11.5]
>  at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:240) 
> [apache-cassandra-3.11.5.jar:3.11.5]
>  at 
> org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run(DebuggableScheduledThreadPoolExecutor.java:118)
>  [apache-cassandra-3.11.5.jar:3.11.5]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_231]
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_231]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_231]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_231]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  [na:1.8.0_231]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  [na:1.8.0_231]
>  at 
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>  [apache-cassandra-3.11.5.jar:3.11.5]
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  ~[netty-all-4.1.42.Final.jar:4.1.42.Final]
>  at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_231]
> {noformat}
> Since CASSANDRA-15059 we check that all state changes are performed in the 
> GossipStage but it seems like it was still performed in the "current" thread 
> [here|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/gms/Gossiper.java#L895].
>  It should be as simp

[jira] [Updated] (CASSANDRA-15900) Close channel and reduce buffer allocation during entire sstable streaming with SSL

2020-07-06 Thread Dinesh Joshi (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-15900:
-
  Since Version: 4.0-alpha1
Source Control Link: 
https://github.com/apache/cassandra/commit/73691944c0ff9b01679cf5a6fe5944ad4c416509
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Committed. Thanks, [~maedhroz] and [~jasonstack]!

> Close channel and reduce buffer allocation during entire sstable streaming 
> with SSL
> ---
>
> Key: CASSANDRA-15900
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15900
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-15740 added the ability to stream entire sstable by loading on-disk 
> file into user-space off-heap buffer when SSL is enabled, because netty 
> doesn't support zero-copy with SSL.
> But there are two issues:
>  # file channel is not closed.
>  # 1mb batch size is used. 1mb exceeds buffer pool's max allocation size, 
> thus it's all allocated outside the pool and will cause large amount of 
> allocations.
> [Patch|https://github.com/apache/cassandra/pull/651]:
>  # close file channel when the last batch is loaded into off-heap bytebuffer. 
> I don't think we need to wait until buffer is flushed by netty.
>  # reduce the batch to 64kb which is more buffer pool friendly when streaming 
> entire sstable with SSL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15900) Close channel and reduce buffer allocation during entire sstable streaming with SSL

2020-07-06 Thread Dinesh Joshi (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-15900:
-
Reviewers: Caleb Rackliffe, Dinesh Joshi  (was: Caleb Rackliffe)

> Close channel and reduce buffer allocation during entire sstable streaming 
> with SSL
> ---
>
> Key: CASSANDRA-15900
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15900
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Streaming and Messaging
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> CASSANDRA-15740 added the ability to stream entire sstable by loading on-disk 
> file into user-space off-heap buffer when SSL is enabled, because netty 
> doesn't support zero-copy with SSL.
> But there are two issues:
>  # file channel is not closed.
>  # 1mb batch size is used. 1mb exceeds buffer pool's max allocation size, 
> thus it's all allocated outside the pool and will cause large amount of 
> allocations.
> [Patch|https://github.com/apache/cassandra/pull/651]:
>  # close file channel when the last batch is loaded into off-heap bytebuffer. 
> I don't think we need to wait until buffer is flushed by netty.
>  # reduce the batch to 64kb which is more buffer pool friendly when streaming 
> entire sstable with SSL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[cassandra] branch trunk updated: Close channel and reduce buffer allocation during entire sstable streaming with SSL

2020-07-06 Thread djoshi

This is an automated email from the ASF dual-hosted git repository.

djoshi pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 7369194  Close channel and reduce buffer allocation during entire 
sstable streaming with SSL
7369194 is described below

commit 73691944c0ff9b01679cf5a6fe5944ad4c416509
Author: Zhao Yang 
AuthorDate: Wed Jun 24 18:37:47 2020 +0800

Close channel and reduce buffer allocation during entire sstable streaming 
with SSL

Patch by Zhao Yang; Reviewed by Caleb Rackliffe and Dinesh Joshi for 
CASSANDRA-15900
---
 CHANGES.txt|  1 +
 .../cassandra/net/AsyncStreamingOutputPlus.java| 66 ++
 .../net/AsyncStreamingOutputPlusTest.java  | 58 +++
 .../unit/org/apache/cassandra/net/TestChannel.java |  4 +-
 4 files changed, 101 insertions(+), 28 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 8fafb7d..c3fdf4f 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0-alpha5
+ * Close channel and reduce buffer allocation during entire sstable streaming 
with SSL (CASSANDRA-15900)
  * Prune expired messages less frequently in internode messaging 
(CASSANDRA-15700)
  * Fix Ec2Snitch handling of legacy mode for dc names matching both formats, 
eg "us-west-2" (CASSANDRA-15878)
  * Add support for server side DESCRIBE statements (CASSANDRA-14825)
diff --git a/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java 
b/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java
index e685584..680a9d3 100644
--- a/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java
+++ b/src/java/org/apache/cassandra/net/AsyncStreamingOutputPlus.java
@@ -23,11 +23,13 @@ import java.nio.ByteBuffer;
 import java.nio.channels.ClosedChannelException;
 import java.nio.channels.FileChannel;
 
+import com.google.common.annotations.VisibleForTesting;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import io.netty.channel.Channel;
 import io.netty.channel.ChannelPromise;
+import io.netty.channel.FileRegion;
 import io.netty.channel.WriteBufferWaterMark;
 import io.netty.handler.ssl.SslHandler;
 import org.apache.cassandra.io.compress.BufferType;
@@ -161,51 +163,65 @@ public class AsyncStreamingOutputPlus extends 
AsyncChannelOutputPlus
 }
 
 /**
+ * Writes all data in file channel to stream: 
+ * * For zero-copy-streaming, 1MiB at a time, with at most 2MiB in flight 
at once. 
+ * * For streaming with SSL, 64kb at a time, with at most 32+64kb (default 
low water mark + batch size) in flight. 
  * 
- * Writes all data in file channel to stream, 1MiB at a time, with at most 
2MiB in flight at once.
- * This method takes ownership of the provided {@code FileChannel}.
+ * This method takes ownership of the provided {@link FileChannel}.
  * 
  * WARNING: this method blocks only for permission to write to the netty 
channel; it exits before
- * the write is flushed to the network.
+ * the {@link FileRegion}(zero-copy) or {@link ByteBuffer}(ssl) is flushed 
to the network.
  */
 public long writeFileToChannel(FileChannel file, StreamRateLimiter 
limiter) throws IOException
 {
-// write files in 1MiB chunks, since there may be blocking work 
performed to fetch it from disk,
-// the data is never brought in process and is gated by the wire anyway
 if (channel.pipeline().get(SslHandler.class) != null)
-return writeFileToChannel(file, limiter, 1 << 20, 1 << 20, 2 << 
20);
+// each batch is loaded into ByteBuffer, 64kb is more BufferPool 
friendly.
+return writeFileToChannel(file, limiter, 1 << 16);
 else
+// write files in 1MiB chunks, since there may be blocking work 
performed to fetch it from disk,
+// the data is never brought in process and is gated by the wire 
anyway
 return writeFileToChannelZeroCopy(file, limiter, 1 << 20, 1 << 20, 
2 << 20);
 }
 
-public long writeFileToChannel(FileChannel fc, StreamRateLimiter limiter, 
int batchSize, int lowWaterMark, int highWaterMark) throws IOException
+@VisibleForTesting
+long writeFileToChannel(FileChannel fc, StreamRateLimiter limiter, int 
batchSize) throws IOException
 {
 final long length = fc.size();
 long bytesTransferred = 0;
-while (bytesTransferred < length)
+
+try
+{
+while (bytesTransferred < length)
+{
+int toWrite = (int) min(batchSize, length - bytesTransferred);
+final long position = bytesTransferred;
+
+writeToChannel(bufferSupplier -> {
+ByteBuffer outBuffer = bufferSupplier.get(toWrite);
+long read = fc.read(outBuffer, position);
+if (read != toWri

[jira] [Comment Edited] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152379#comment-17152379
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15685 at 7/7/20, 12:10 AM:
---

Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your 
expert advice? Is fixing the test enough here or [this behavior | 
https://issues.apache.org/jira/browse/CASSANDRA-15685?focusedCommentId=17121396&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121396]
   should also be considered? Thank you in advance :)


was (Author: e.dimitrova):
Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your 
expert advice? Is fixing the test enough here or this behavior  should also be 
considered? Thank you in advance :)

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Ekaterina Dimitrova
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: log-CASSANDRA-15685.txt, output
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}
> [Circle 
> CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15685) flaky testWithMismatchingPending - org.apache.cassandra.distributed.test.PreviewRepairTest



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152379#comment-17152379
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15685:
-

Back to this work. [~blerer], [~bdeggleston], [~marcuse], may I ask for your 
expert advice? Is fixing the test enough here or this behavior  should also be 
considered? Thank you in advance :)

> flaky testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> --
>
> Key: CASSANDRA-15685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Kevin Gallardo
>Assignee: Ekaterina Dimitrova
>Priority: Normal
>  Labels: pull-request-available
> Fix For: 4.0-beta
>
> Attachments: log-CASSANDRA-15685.txt, output
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Observed in: 
> https://app.circleci.com/pipelines/github/newkek/cassandra/34/workflows/1c6b157d-13c3-48a9-85fb-9fe8c153256b/jobs/191/tests
> Failure:
> {noformat}
> testWithMismatchingPending - 
> org.apache.cassandra.distributed.test.PreviewRepairTest
> junit.framework.AssertionFailedError
>   at 
> org.apache.cassandra.distributed.test.PreviewRepairTest.testWithMismatchingPending(PreviewRepairTest.java:97)
> {noformat}
> [Circle 
> CI|https://circleci.com/gh/dcapwell/cassandra/tree/bug%2FCASSANDRA-15685]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Issue Comment Deleted] (CASSANDRA-8675) COPY TO/FROM broken for newline characters

2020-07-06 Thread Jai Bheemsen Rao Dhanwada (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jai Bheemsen Rao Dhanwada updated CASSANDRA-8675:
-
Comment: was deleted

(was: I tried the patch, but still running into the issue where if I look at 
the data with cqlsh I see  a yellow '\n' after the import (literal) instead of  
purple '\n'  (control character) )

> COPY TO/FROM broken for newline characters
> --
>
> Key: CASSANDRA-8675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8675
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Tools
> Environment: [cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native 
> protocol v3]
> Ubuntu 14.04 64-bit
>Reporter: Lex Lythius
>Priority: Normal
>  Labels: cqlsh, remove-reopen
> Fix For: 3.0.x
>
> Attachments: CASSANDRA-8675.patch, copytest.csv
>
>
> Exporting/importing does not preserve contents when texts containing newline 
> (and possibly other) characters are involved:
> {code:sql}
> cqlsh:test> create table if not exists copytest (id int primary key, t text);
> cqlsh:test> insert into copytest (id, t) values (1, 'This has a newline
> ... character');
> cqlsh:test> insert into copytest (id, t) values (2, 'This has a quote " 
> character');
> cqlsh:test> insert into copytest (id, t) values (3, 'This has a fake tab \t 
> character (typed backslash, t)');
> cqlsh:test> select * from copytest;
>  id | t
> +-
>   1 |   This has a newline\ncharacter
>   2 |This has a quote " character
>   3 | This has a fake tab \t character (entered slash-t text)
> (3 rows)
> cqlsh:test> copy copytest to '/tmp/copytest.csv';
> 3 rows exported in 0.034 seconds.
> cqlsh:test> copy copytest from '/tmp/copytest.csv';
> 3 rows imported in 0.005 seconds.
> cqlsh:test> select * from copytest;
>  id | t
> +---
>   1 |  This has a newlinencharacter
>   2 |  This has a quote " character
>   3 | This has a fake tab \t character (typed backslash, t)
> (3 rows)
> {code}
> I tried replacing \n in the CSV file with \\n, which just expands to \n in 
> the table; and with an actual newline character, which fails with error since 
> it prematurely terminates the record.
> It seems backslashes are only used to take the following character as a 
> literal
> Until this is fixed, what would be the best way to refactor an old table with 
> a new, incompatible structure maintaining its content and name, since we 
> can't rename tables?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15894) JAVA 8: test_multiple_repair - repair_tests.incremental_repair_test.TestIncRepair



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-15894:

Status: In Progress  (was: Patch Available)

> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> -
>
> Key: CASSANDRA-15894
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15894
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 8: test_multiple_repair - 
> repair_tests.incremental_repair_test.TestIncRepair
> Fails locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/46515d14-9be4-4edb-8db4-5930312d2bfb/jobs/1329



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15893 at 7/6/20, 11:29 PM:
---

Thank you for your time and work [~Bereng], I know how frustrating this could 
be but not being able to reproduce easily test failures happens sometimes. 

I am already looking into it, I am moving the ticket back to work in progress. 
Thank you one more time!


was (Author: e.dimitrova):
Thank you for your time and work [~Bereng], I know how frustrating this could 
be but not being able to reproduce test failures easy happens sometimes. 

I am already looking into it, I am moving the ticket back to work in progress. 
Thank you one more time!

> JAVA 11: test_short_read - consistency_test.TestConsistency
> ---
>
> Key: CASSANDRA-15893
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15893
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 11: test_short_read - consistency_test.TestConsistency
> Failing locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15893 at 7/6/20, 11:26 PM:
---

Thank you for your time and work [~Bereng], I know how frustrating this could 
be but not being able to reproduce test failures easy happens sometimes. 

I am already looking into it, I am moving the ticket back to work in progress. 
Thank you one more time!


was (Author: e.dimitrova):
Thank you for your time and work [~Bereng], I know how frustrating this could 
be but it happens sometimes. 

I am already looking into it, I am moving it back to work in progress. Thank 
you one more time!

> JAVA 11: test_short_read - consistency_test.TestConsistency
> ---
>
> Key: CASSANDRA-15893
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15893
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 11: test_short_read - consistency_test.TestConsistency
> Failing locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152367#comment-17152367
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15893:
-

Thank you for your time and work [~Bereng], I know how frustrating this could 
be but it happens sometimes. 

I am already looking into it, I am moving it back to work in progress. Thank 
you one more time!

> JAVA 11: test_short_read - consistency_test.TestConsistency
> ---
>
> Key: CASSANDRA-15893
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15893
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 11: test_short_read - consistency_test.TestConsistency
> Failing locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15893) JAVA 11: test_short_read - consistency_test.TestConsistency



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-15893:

Status: In Progress  (was: Patch Available)

> JAVA 11: test_short_read - consistency_test.TestConsistency
> ---
>
> Key: CASSANDRA-15893
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15893
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-rc
>
>
> JAVA 11: test_short_read - consistency_test.TestConsistency
> Failing locally and in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/222/workflows/11202c7e-6c94-4d4e-bbbf-9e2fa9791ad0/jobs/1337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-8675) COPY TO/FROM broken for newline characters

2020-07-06 Thread Jai Bheemsen Rao Dhanwada (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152366#comment-17152366
 ] 

Jai Bheemsen Rao Dhanwada commented on CASSANDRA-8675:
--

I tried the patch, but still running into the issue where if I look at the data 
with cqlsh I see  a yellow '\n' after the import (literal) instead of  purple 
'\n'  (control character) 

> COPY TO/FROM broken for newline characters
> --
>
> Key: CASSANDRA-8675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8675
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Tools
> Environment: [cqlsh 5.0.1 | Cassandra 2.1.2 | CQL spec 3.2.0 | Native 
> protocol v3]
> Ubuntu 14.04 64-bit
>Reporter: Lex Lythius
>Priority: Normal
>  Labels: cqlsh, remove-reopen
> Fix For: 3.0.x
>
> Attachments: CASSANDRA-8675.patch, copytest.csv
>
>
> Exporting/importing does not preserve contents when texts containing newline 
> (and possibly other) characters are involved:
> {code:sql}
> cqlsh:test> create table if not exists copytest (id int primary key, t text);
> cqlsh:test> insert into copytest (id, t) values (1, 'This has a newline
> ... character');
> cqlsh:test> insert into copytest (id, t) values (2, 'This has a quote " 
> character');
> cqlsh:test> insert into copytest (id, t) values (3, 'This has a fake tab \t 
> character (typed backslash, t)');
> cqlsh:test> select * from copytest;
>  id | t
> +-
>   1 |   This has a newline\ncharacter
>   2 |This has a quote " character
>   3 | This has a fake tab \t character (entered slash-t text)
> (3 rows)
> cqlsh:test> copy copytest to '/tmp/copytest.csv';
> 3 rows exported in 0.034 seconds.
> cqlsh:test> copy copytest from '/tmp/copytest.csv';
> 3 rows imported in 0.005 seconds.
> cqlsh:test> select * from copytest;
>  id | t
> +---
>   1 |  This has a newlinencharacter
>   2 |  This has a quote " character
>   3 | This has a fake tab \t character (typed backslash, t)
> (3 rows)
> {code}
> I tried replacing \n in the CSV file with \\n, which just expands to \n in 
> the table; and with an actual newline character, which fails with error since 
> it prematurely terminates the record.
> It seems backslashes are only used to take the following character as a 
> literal
> Until this is fixed, what would be the best way to refactor an old table with 
> a new, incompatible structure maintaining its content and name, since we 
> can't rename tables?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Description: 
h4. Problem

The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
for the current offset in the region. Allocations depends on a 
{{.compareAndSet(..)}} call.

In highly contended environments the CAS failures can be high, starving writes 
in a running Cassandra node.

h4. Example

It has been witnessed up to 33% of CPU time stuck in the 
{{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during a 
heavy spark analytics write load.

These nodes: 40 CPU cores and 256GB ram; have relevant settings
 - {{memtable_allocation_type: offheap_objects}}
 - {{memtable_offheap_space_in_mb: 5120}}
 - {{concurrent_writes: 160}}

Numerous  flamegraphs demonstrate the problem. See attached 
[^profile_pbdpc23zafsrh_20200702.svg].

h4. Suggestion: ThreadLocal Regions

One possible solution is to have separate Regions per thread.  
Code wise this is relatively easy to do, for example replacing 
NativeAllocator:59 
{code}private final AtomicReference currentRegion = new 
AtomicReference<>();{code}
with
{code}private final ThreadLocal> currentRegion = new 
ThreadLocal<>() {...};{code}

But this approach substantially changes the allocation behaviour, with more 
than concurrent_writes number of Regions in use at any one time. For example 
with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 

h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)

Another possible solution is to introduce a contention management algorithm 
that a) reduces CAS failures in high contention environments, b) doesn't impact 
normal environments, and c) keeps the allocation strategy of using one region 
at a time.

The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] describes 
this contention CAS problem and demonstrates a number of algorithms to apply. 
The simplest of these algorithms is the Constant Backoff CAS Algorithm.

Applying the Constant Backoff CAS Algorithm involves adding one line of code to 
{{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
number) nanoseconds after a CAS failure occurs.
That is...
{code}
// we raced and lost alloc, try again
LockSupport.parkNanos(1);
{code}

h4. Constant Backoff CAS Algorithm Experiments

Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 

In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
also the {{casFailures}} field added. The following two screenshots are from 
data collected from this class on a 6 CPU (12 core) MBP, running the 
{{NativeAllocatorRegionTest.testRegionCAS}} method.

This attached screenshot shows the number of CAS failures during the life of a 
Region (over ~215 million allocations), using different threads and park times. 
This illustrates the improvement (reduction) of CAS failures from zero park 
time, through orders of magnitude, up to 1000ns (10ms). The biggest 
improvement is from no algorithm to a park time of 1ns where CAS failures are 
~two orders of magnitude lower. From a park time 10μs and higher there is a 
significant drop also at low contention rates.

 !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 

This attached screenshot shows the time it takes to fill a Region (~215 million 
allocations), using different threads and park times. The biggest improvement 
is from no algorithm to a park time of 1ns where performance is one order of 
magnitude faster. From a park time of 100μs and higher there is a even further 
significant drop, especially at low contention rates.

 !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! 

Repeating the test run show reliably similar results:  [^Screen Shot 2020-07-05 
at 13.37.01.png]  and  [^Screen Shot 2020-07-05 at 13.35.55.png].

h4. Region Per Thread Experiments

Implementing Region Per Thread: see the 
{{NativeAllocatorRegionTest.testRegionThreadLocal}} method; we can expect zero 
CAS failures of the life of a Region. For performance we see two orders of 
magnitude lower times to fill up the Region (~420ms).

 !Screen Shot 2020-07-05 at 13.48.16.png|width=200px! 

h4. Costs

Region per Thread is an unrealistic solution as it introduces many new issues 
and problems, from increased memory use to leaking memory and GC issues. It is 
better tackled as part of a TPC implementation.

The backoff approach is simple and elegant, and seems to improve throughput in 
all situations. It does introduce context switches which may impact throughput 
in some busy throughput scenarios, so this should to be tested further.

  was:
h4. Problem

The method {{NativeAllocat

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters

[
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252
]

Michael Semb Wever edited comment on CASSANDRA-15234 at 7/6/20, 8:30 PM:
-

{quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith,
Benjamin Lerer, Michael Semb Wever, do you agree with me on my
suggestion(reorganizing the yaml file and doing the nested parameters approach
later)?
{quote}
Let's keep listening to what everyone has to say. Some of us are better with
the written word than others, it is a second language for some, and for me as a
native-english-speaker it is still all too easy to miss things the first time
they are said. On that, I believe everyone hears and recognises what
[~e.dimitrova] is saying here regarding frustrations about such a substantial
change being suggested so late in the game and the amount of time that's been
asked to re-invest. Especially when an almost identical user experience
improvement was presented two months ago. But it should be said again.

On a side-note, it would have really helped me a lot if the comment
[above|https://issues.apache.org/jira/browse/CASSANDRA-15234?focusedCommentId=17150521&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17150521]
back-referenced [this
discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020]
where it originated. I know the ticket was referenced, but that discussion
thread is the source of the suggestion.

{quote}This ticket’s API replaces the current API, is mutually exclusive with
the alternative proposal, and would be deprecated by it. If we introduce them
both in 4.0-beta, we must maintain them both and go through the full
deprecation process. So unfortunately no churn is avoided.
{quote}
AFAIK this is the only "grounded" justification for the veto. I don't agree
that we are forced into that premise. We can get around those compatibility
rules with a minimal amount of effort, by not deprecating the old API and not
announcing (in docs or yaml) the new API. (I would expect everyone to
intuitively treat private undocumented and un-referenced APIs, only ever
available in alpha and beta releases, as unsupported.) All that "compatibility
change" can be left and just done once in the separate ticket. The underlying
framework and bulk of this patch can still be merged.

Based on that I see three possible courses of action:
1. Accept investigating the alternative proposal, and include it in this
ticket, delaying our first 4.0-beta release,
2. As (1) but requesting this ticket to be merged during 4.0-beta, so we can
release 4.0-beta now,
3. Spin out the new suggestion and all public API changes to a separate
ticket, slated for 4.0-beta, and merging this ticket.

I suspect, since you have offered to help [~benedict], that most are in favour
of (2) ?

was (Author: michaelsembwever):
{quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith,
Benjamin Lerer, Michael Semb Wever, do you agree with me on my
suggestion(reorganizing the yaml file and doing the nested parameters approach
later)?
{quote}
Let's keep listening to what everyone has to say. Some of us are better with
the written word than others, it is a second language for some, and for me as a
native-english-speaker it is still all too easy to miss things the first time
they are said. On that, I believe everyone hears and recognises what
[~e.dimitrova] is saying here regarding frustrations about such a substantial
change being suggested so late in the game and the amount of time that's been
asked to re-invest. Especially when an almost identical user experience
improvement was presented two months ago. But it should be said again.

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters

2020-07-06 Thread David Capwell (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152266#comment-17152266
 ] 

David Capwell commented on CASSANDRA-15234:
---

Reread the conversations that have been going on over the past 3 days several 
times, sorry if I missed anything or didn't grasp all points.

Most of the thread is about doing an all or nothing approach, thanks [~mck] for 
trying to argue for incremental improvement.  Looking at the list of properties 
impacted (see 
https://github.com/apache/cassandra/compare/trunk...ekaterinadimitrova2:CASSANDRA-15234-new#diff-4302f2407249672d7845cd58027ff6e9R257-R339)
 it looks like a subset would be clearly impacted by the grouping approach, and 
others not so much or are complementary; given this we could accept a hand full 
of the properties and move the other properties into the grouping work (stuff 
such as read_request_timeout_in_ms to read_request_timeout I feel are fine even 
with the grouping approach, but stuff which renames things maybe leave out for 
now, such as enable_user_defined_functions to user_defined_functions_enabled).

I do agree with [~benedict] that it isn't ok to keep changing our config API 
since this is user facing, we should be strict about user facing changes and 
try to help more than harm.  If there is a belief that one structure is better 
than another then I value this dialog and hope we can get more eyes from the 
users/operators to see their thoughts; for this work we should really speak 
about the YAML representation rather than the code so we can agree on the final 
result.  Also, given the framework that is provided by this patch, I don't see 
that work as throwing everything away, instead I see it benefiting from the 
work which is started.  Given the work involved is to add support for "moving" 
a field (current "rename" is a special case of move where the move is at the 
same level) from one location to another (rename and conversion already 
supported), this adds complexity for the case where the new and the old field 
are both used and may hit complexity issues with SnakeYaml implementation.  I 
do believe we should have this discussion and settle on a solution before 
releasing 4.0.0, but I do not feel that this discussion blocks a beta release. 
There is a lot of chatting about being a beta blocker but I don't really follow 
why this JIRA (or the grouping one) is a blocker.  Reading 
https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle I don't 
see why this JIRA could not be done during beta, it meets every requirement for 
this phase.

Given all the comments above, my TL;DR

* Can we find a subset of properties with the current patch which are not 
discarded by the grouping work (sample given above)
* Can we start the conversation and start asking operators of Cassandra 
clusters on their thoughts on grouping vs not grouping?  Grouping could be nice 
for humans but could be horrid for some automation (I am neither pro or against 
grouping, I defer to operators preference here).
* Can we mark this ticket and the grouping one as non-blocking for beta

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters

[
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252
]

Michael Semb Wever edited comment on CASSANDRA-15234 at 7/6/20, 7:09 PM:
-

{quote}This ticket’s API replaces the current API, is mutually exclusive with
the alternative proposal, and would be deprecated by it. If we introduce them
both in 4.0-beta, we must maintain them both and go through the full
deprecation process. So unfortunately no churn is avoided.
{quote}
AFAIK this is the only "grounded" justification for the veto. I don't agree
that we are forced into that premise. We can get around those compatibility
rules with a minimal amount of effort, by not deprecating the old API and not
announcing (in docs or yaml) the new API. (I would expect everyone to
intuitively treat private undocumented and un-referenced APIs, only ever
available in alpha and beta releases, to be considered unsupported.) All that
"compatibility change" can be left and just done once in the separate ticket.
The underlying framework and bulk of this patch can still be merged.

I suspect, since you have offered to help [~benedict], that most are in favour
of (2) ?

On a side-note, it would have really helped me a lot if the comment above
back-referenced [this
discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020]
where it originated. I know the ticket was referenced, but that discussion
thread is the source of the suggestion.

{quote}This ticket’s API replaces the current API, is mutually exclusive with
the alternative proposal, and would be deprecated by it. If we introduce them
both in 4.0-beta, we must maintain them both and go through the full
deprecation process. So unfortunately no churn is avoided.
{quote}
AFAIK this is the only "grounded" justification for the veto. I don't agree
that we are forced into that premise. We can get around those compatibility
rules with a minimal amount of effort, by not deprecating the old API and not
announcing (in docs or yaml) the new API. (I would expect everyone to
intuitively treat private undocumented and un-referenced APIs, only ever
available in alpha and beta releases, to be considered unsupported.) All that
"compatibil

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152252#comment-17152252
 ] 

Michael Semb Wever commented on CASSANDRA-15234:


{quote}I am fine to be proved wrong in a justified way. Benedict Elliott Smith, 
Benjamin Lerer, Michael Semb Wever, do you agree with me on my 
suggestion(reorganizing the yaml file and doing the nested parameters approach 
later)?
{quote}
Let's keep listening to what everyone has to say. Some of us are better with 
the written word than others, it is a second language for some, and for me as a 
native-english-speaker it is still all too easy to miss things the first time 
they are said. On that, I believe everyone hears and recognises what 
[~e.dimitrova] is saying here regarding frustrations about such a substantial 
change being suggested so late in the game and the amount of time that's been 
asked to re-invest. Especially when an almost identical user experience 
improvement was presented two months ago. But it should be said again.

On a side-note, it would have really helped me a lot if the comment above 
back-referenced [this 
discussion|https://github.com/apache/cassandra/pull/659#discussion_r449201020] 
where it originated. I know the ticket was referenced, but that discussion 
thread is the source of the suggestion.

 
{quote}This ticket’s API replaces the current API, is mutually exclusive with 
the alternative proposal, and would be deprecated by it. If we introduce them 
both in 4.0-beta, we must maintain them both and go through the full 
deprecation process. So unfortunately no churn is avoided.
{quote}
AFAIK this is the only "grounded" justification for the veto. I don't agree 
that we are forced into that premise. We can get around those compatibility 
rules with a minimal amount of effort, by not deprecating the old API and not 
announcing (in docs or yaml) the new API. (I would expect everyone to 
intuitively treat private undocumented and un-referenced APIs, only ever 
available in alpha and beta releases, to be considered unsupported.)  All that 
"compatibility change" can be left and just done once in the separate ticket. 
The underlying framework and bulk of this patch can still be merged.

 

Based on that I see three possible courses of action:
 1. Accept investigating the alternative proposal, and include it in this 
ticket, delaying our first 4.0-beta release,
 2. As (1) but requesting this ticket to be merged during 4.0-beta, so we can 
release 4.0-beta now,
 3. Spin out the new suggestion and all public API changes to a separate 
ticket, slated for 4.0-beta, and merging this ticket.

 

I suspect, since you have offered to help [~benedict], that most are in favour 
of (2) ?

 

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15907:

Reviewers: Andres de la Peña, Jordan West, Caleb Rackliffe  (was: Andres de 
la Peña, Caleb Rackliffe, Jordan West)
   Andres de la Peña, Jordan West, Caleb Rackliffe  (was: Andres de 
la Peña, Jordan West)
   Status: Review In Progress  (was: Patch Available)

> Operational Improvements & Hardening for Replica Filtering Protection
> -
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination, Feature/2i Index
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152248#comment-17152248
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15907 at 7/6/20, 6:58 PM:
--

[~jwest] I've hopefully addressed the points from [~adelapena]'s first round of 
review, so I think this is officially ready for a second reviewer.

3.0: [patch|https://github.com/apache/cassandra/pull/659], 
[CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97]

If we're happy with the implementation, the next step will be to do some basic 
stress testing.


was (Author: maedhroz):
[~jwest] I've hopefully addressed the points from [~adelapena]'s first round of 
review, so I think this is officially ready for a second reviewer.

3.0: [patch|https://github.com/apache/cassandra/pull/659], 
[CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97]

> Operational Improvements & Hardening for Replica Filtering Protection
> -
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination, Feature/2i Index
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This me

[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15907:

Test and Documentation Plan: 
The first line of defense against regression here is the set of dtests built 
for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll 
need at minimum a basic battery of in-JVM dtests around the new guardrails.

Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering 
workload to stress things a bit, both to see how things behave with larger sets 
of query results when filtering protection isn't activated, and to see how the 
thresholds work when we have severely out-of-sync replicas.

  was:
The first line of defense against regression here is the set of dtests built 
for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll 
need at minimum a basic battery of in-JVM dtests around the new guardrails.

 

Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering 
workload to stress things a bit, both to see how things behave with larger sets 
of query results when filtering protection isn't activated, and to see how the 
thresholds work when we have severely out-of-sync replicas.


> Operational Improvements & Hardening for Replica Filtering Protection
> -
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination, Feature/2i Index
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexit

[jira] [Commented] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152248#comment-17152248
 ] 

Caleb Rackliffe commented on CASSANDRA-15907:
-

[~jwest] I've hopefully addressed the points from [~adelapena]'s first round of 
review, so I think this is officially ready for a second reviewer.

3.0: [patch|https://github.com/apache/cassandra/pull/659], 
[CircleCI|https://app.circleci.com/pipelines/github/maedhroz/cassandra/22/workflows/d272c9e6-1db6-472f-93d9-f2715a25ef97]

> Operational Improvements & Hardening for Replica Filtering Protection
> -
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination, Feature/2i Index
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15907) Operational Improvements & Hardening for Replica Filtering Protection



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-15907:

Test and Documentation Plan: 
The first line of defense against regression here is the set of dtests built 
for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, we'll 
need at minimum a basic battery of in-JVM dtests around the new guardrails.

 

Once the implementation is reviewed, we'll use the \{{tlp-stress}} filtering 
workload to stress things a bit, both to see how things behave with larger sets 
of query results when filtering protection isn't activated, and to see how the 
thresholds work when we have severely out-of-sync replicas.

  was:The first line of defense against regression here is the set of dtests 
built for CASSANDRA-8272 in {{replica_side_filtering}}. In addition to that, 
we'll need at minimum a basic battery of in-JVM dtests around the new 
guardrails.

 Status: Patch Available  (was: In Progress)

> Operational Improvements & Hardening for Replica Filtering Protection
> -
>
> Key: CASSANDRA-15907
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15907
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Coordination, Feature/2i Index
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
>  Labels: 2i, memory
> Fix For: 3.0.x, 3.11.x, 4.0-beta
>
>
> CASSANDRA-8272 uses additional space on the heap to ensure correctness for 2i 
> and filtering queries at consistency levels above ONE/LOCAL_ONE. There are a 
> few things we should follow up on, however, to make life a bit easier for 
> operators and generally de-risk usage:
> (Note: Line numbers are based on {{trunk}} as of 
> {{3cfe3c9f0dcf8ca8b25ad111800a21725bf152cb}}.)
> *Minor Optimizations*
> * {{ReplicaFilteringProtection:114}} - Given we size them up-front, we may be 
> able to use simple arrays instead of lists for {{rowsToFetch}} and 
> {{originalPartitions}}. Alternatively (or also), we may be able to null out 
> references in these two collections more aggressively. (ex. Using 
> {{ArrayList#set()}} instead of {{get()}} in {{queryProtectedPartitions()}}, 
> assuming we pass {{toFetch}} as an argument to {{querySourceOnKey()}}.)
> * {{ReplicaFilteringProtection:323}} - We may be able to use 
> {{EncodingStats.merge()}} and remove the custom {{stats()}} method.
> * {{DataResolver:111 & 228}} - Cache an instance of 
> {{UnaryOperator#identity()}} instead of creating one on the fly.
> * {{ReplicaFilteringProtection:217}} - We may be able to scatter/gather 
> rather than serially querying every row that needs to be completed. This 
> isn't a clear win perhaps, given it targets the latency of single queries and 
> adds some complexity. (Certainly a decent candidate to kick even out of this 
> issue.)
> *Documentation and Intelligibility*
> * There are a few places (CHANGES.txt, tracing output in 
> {{ReplicaFilteringProtection}}, etc.) where we mention "replica-side 
> filtering protection" (which makes it seem like the coordinator doesn't 
> filter) rather than "replica filtering protection" (which sounds more like 
> what we actually do, which is protect ourselves against incorrect replica 
> filtering results). It's a minor fix, but would avoid confusion.
> * The method call chain in {{DataResolver}} might be a bit simpler if we put 
> the {{repairedDataTracker}} in {{ResolveContext}}.
> *Testing*
> * I want to bite the bullet and get some basic tests for RFP (including any 
> guardrails we might add here) onto the in-JVM dtest framework.
> *Guardrails*
> * As it stands, we don't have a way to enforce an upper bound on the memory 
> usage of {{ReplicaFilteringProtection}} which caches row responses from the 
> first round of requests. (Remember, these are later used to merged with the 
> second round of results to complete the data for filtering.) Operators will 
> likely need a way to protect themselves, i.e. simply fail queries if they hit 
> a particular threshold rather than GC nodes into oblivion. (Having control 
> over limits and page sizes doesn't quite get us there, because stale results 
> _expand_ the number of incomplete results we must cache.) The fun question is 
> how we do this, with the primary axes being scope (per-query, global, etc.) 
> and granularity (per-partition, per-row, per-cell, actual heap usage, etc.). 
> My starting disposition   on the right trade-off between 
> performance/complexity and accuracy is having something along the lines of 
> cached rows per query. Prior art suggests this probably makes sense alongside 
> things like {{tombstone_failure_threshold}} in {{cassandra.yaml}}.



--
This message was sen

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152224#comment-17152224
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15234 at 7/6/20, 6:00 PM:
--

My mental model of why grouping _might_ be valuable:
 * It provides a logical place to describe/comment on entire features in the 
YAML.
 * It avoids duplicate/unwieldy prefixing without sacrificing 
intelligibility/specificity.
 * It doesn't rely on the presence of comments.

My understanding of the changes here is that there are dozens of options that 
have already been renamed. Assuming we proceed with grouping, supporting three 
different forms of these options doesn't seem like the outcome we want. There 
are really only a handful of groupings that would be interesting and obvious. 
Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe 
the caches. (Timeouts seem a bit scattered.)

What I'm most worried about is the number of versions we have to support at any 
given time, not whether we change some option grouping early in the beta 
period. My vote, at this point, would be to just move this issue to beta and 
hash out a proposal for the (somewhat obvious) option groups I've mentioned 
above.


was (Author: maedhroz):
My mental model of why grouping _might_ be valuable:
 * It provides a logical place to describe/comment on entire features in the 
YAML.
 * It avoids duplicate prefixing without sacrificing 
intelligibility/specificity.
 * It doesn't rely on the presence of comments.

My understanding of the changes here is that there are dozens of options that 
have already been renamed. Assuming we proceed with grouping, supporting three 
different forms of these options doesn't seem like the outcome we want. There 
are really only a handful of groupings that would be interesting and obvious. 
Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe 
the caches. (Timeouts seem a bit scattered.)

What I'm most worried about is the number of versions we have to support at any 
given time, not whether we change some option grouping early in the beta 
period. My vote, at this point, would be to just move this issue to beta and 
hash out a proposal for the (somewhat obvious) option groups I've mentioned 
above.

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152224#comment-17152224
 ] 

Caleb Rackliffe commented on CASSANDRA-15234:
-

My mental model of why grouping _might_ be valuable:
 * It provides a logical place to describe/comment on entire features in the 
YAML.
 * It avoids duplicate prefixing without sacrificing 
intelligibility/specificity.
 * It doesn't rely on the presence of comments.

My understanding of the changes here is that there are dozens of options that 
have already been renamed. Assuming we proceed with grouping, supporting three 
different forms of these options doesn't seem like the outcome we want. There 
are really only a handful of groupings that would be interesting and obvious. 
Essentially, hinted handoff, commitlog, memtable, rpc, compaction, and maybe 
the caches. (Timeouts seem a bit scattered.)

What I'm most worried about is the number of versions we have to support at any 
given time, not whether we change some option grouping early in the beta 
period. My vote, at this point, would be to just move this issue to beta and 
hash out a proposal for the (somewhat obvious) option groups I've mentioned 
above.

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other

2020-07-06 Thread David Capwell (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1715#comment-1715
 ] 

David Capwell commented on CASSANDRA-15909:
---

bq. The good news is that these two are really new and only present in trunk 
(and only added in the last few weeks), so we don't need to bother with 
deprecation.

Sounds good to me

> Make Table/Keyspace Metric Names Consistent With Each Other
> ---
>
> Key: CASSANDRA-15909
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15909
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Metrics
>Reporter: Stephen Mallette
>Assignee: Stephen Mallette
>Priority: Normal
> Fix For: 4.0-beta
>
>
> As part of CASSANDRA-15821 it became apparent that certain metric names found 
> in keyspace and tables had different names but were in fact the same metric - 
> they are as follows:
> * Table.SyncTime == Keyspace.RepairSyncTime
> * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows
> * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime
> * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize
> * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize
> * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize
> * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize
> Also, client metrics are the only metrics to start with a lower case letter. 
> Change those to upper case to match all the other metrics.
> Unifying this naming would help make metrics more consistent as part of 
> CASSANDRA-15582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:08 PM:
--

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 

{quote} I think this is indeed preferable to releasing an API we already expect 
to deprecate, however I think we're overstating the difficulty here. We haven't 
debated the parameter naming much at all, and we can easily land this in 
4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an 
acceptable window to land the work, I can take a look in a few weeks.  {quote}

 

I want to be clear - it is not about difficulty, this patch is time consuming. 
It needs attention to the detail and look at the whole config which touches the 
code at many places(also ccm, dtests, in-jvm tests, etc)

 


was (Author: e.dimitrova):
Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that p

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:07 PM:
--

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 

{quote} I think this is indeed preferable to releasing an API we already expect 
to deprecate, however I think we're overstating the difficulty here. We haven't 
debated the parameter naming much at all, and we can easily land this in 
4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an 
acceptable window to land the work, I can take a look in a few weeks. \{quote}

 

I want to be clear - it is not about difficulty, this patch is time consuming. 
It needs attention to the detail and look at the whole config which touches the 
code at many places(also ccm, dtests, in-jvm tests, etc)

 


was (Author: e.dimitrova):
Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that p

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:06 PM:
--

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 

{quote}I think this is indeed preferable to releasing an API we already expect 
to deprecate, however I think we're overstating the difficulty here. We haven't 
debated the parameter naming much at all, and we can easily land this in 
4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an 
acceptable window to land the work, I can take a look in a few weeks. \{quote}

 

I want to be clear - it is not about difficulty, this patch is time consuming. 
It needs attention to the detail and look at the whole config which touches the 
code at many places(also ccm, dtests, in-jvm tests, etc)

 


was (Author: e.dimitrova):
Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that pe

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 5:02 PM:
--

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 

{quote} I think this is indeed preferable to releasing an API we already expect 
to deprecate, however I think we're overstating the difficulty here. We haven't 
debated the parameter naming much at all, and we can easily land this in 
4.0-beta. If [~e.dimitrova] doesn't have the time, and 4.0-beta is an 
acceptable window to land the work, I can take a look in a few weeks. \{quote}

I want to be clear - it is not about difficulty, this patch is time consuming. 
It needs attention to the detail and look at the whole config which touches the 
code at many places(also ccm, dtests, in-jvm tests, etc)

 


was (Author: e.dimitrova):
Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that pers

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-15234 at 7/6/20, 4:57 PM:
--

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread . 
 This was a quick draft shared two months ago that could be reworked to 
sections that satisfy the users' requirements for clarity and consistency.

Do we see any big difference for the users between:
{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 


was (Author: e.dimitrova):
Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread. 
This was a quick draft shared two months ago that could be reworked to sections 
that satisfy the users' requirements for clarity and consistency.

 Do we see any big difference for the users between:

{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml f

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152159#comment-17152159
 ] 

Ekaterina Dimitrova commented on CASSANDRA-15234:
-

Apologize for my late response. I was a bit sick these days and tried to 
disengage from work and take some rest over the weekend.

With all my respect to everyone's opinion and experience on the project, I have 
two points here:
 - I truly support [~mck]'s questions. I believe they should be responded 
before any decision is taken and someone jumps into actual work.
{quote}how many settings does it apply to?
 is taxonomy based on a technical or user perspective?
 if user/operator based, how many people need to be involved to get it right?
 if user/operator based, what if one property applies to multiple concerns?
 how does the @Replace annotation work between levels in the grouping?
 does this introduce more complexity/variations in what has to be tested? 
(since yaml can consist of old and new setting names)
{quote}
 - I was also wondering today while I was trying to be open-minded and look 
from all perspectives at this ticket/patch... Did anyone check the first 
[commit 
|https://github.com/ekaterinadimitrova2/cassandra/blob/CASSANDRA-15234-1-outdated/conf/cassandra.yaml]
 where I was suggesting reorganizing of the text into the yaml into sections? I 
also put it into the ticket thread. 
This was a quick draft shared two months ago that could be reworked to sections 
that satisfy the users' requirements for clarity and consistency.

 Do we see any big difference for the users between:

{code:java}
#*Replica Filtering Protection*

cached_rows_warn_threshold: 1000
cached_rows_fail_threshold: 16000
{code}
and:
{code:java}
replica_filtering_protection:
 - cached_rows_warn_threshold: 1000
 - cached_rows_fail_threshold: 16000
{code}
>From that perspective, I think the C* community can accept this patch and then 
>we can raise a new ticket) to improve the internals from our engineering 
>perspective in Beta(refactoring the Config class and the backward 
>compatibility framework), as suggested by [~mck]. I think this work could be 
>really considered incremental work.

Having that in mind, honestly, I don't find a justification to spend my time to 
rework and fully re-test the patch at this point in time.

I am fine to be proved wrong in a justified way. [~benedict], [~blerer], 
[~mck], do you agree with me on my suggestion(reorganizing the yaml file and 
doing the nested parameters approach later)?

 

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15924:

Reviewers: Alex Petrov, Sylvain Lebresne

[~ifesdjeen] & [~slebresne] do you have time to review?

> Avoid emitting empty range tombstones from RangeTombstoneList
> -
>
> Key: CASSANDRA-15924
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15924
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> In {{RangeTombstoneList#iterator}} there is a chance we emit empty range 
> tombstones depending on the slice passed in. This can happen during read 
> repair with either an empty slice or with paging and the final page being 
> empty.
> This creates problems in RTL if we try to insert a new range tombstone which 
> covers the empty ones;
> {code}
> Caused by: java.lang.AssertionError
>   at 
> org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541)
>   at 
> org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217)
>   at 
> org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141)
>   at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137)
>   at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
>   at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210)
>   at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15924:

Test and Documentation Plan: new tests
 Status: Patch Available  (was: Open)

https://github.com/krummas/cassandra/commits/marcuse/15924
(also includes a fix to RowAndDeletionMergeIterator to make sure there are no 
other paths creating these empty tombstones)

unit tests: https://circleci.com/gh/krummas/cassandra/3440
jvm dtests: https://circleci.com/gh/krummas/cassandra/3441

> Avoid emitting empty range tombstones from RangeTombstoneList
> -
>
> Key: CASSANDRA-15924
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15924
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> In {{RangeTombstoneList#iterator}} there is a chance we emit empty range 
> tombstones depending on the slice passed in. This can happen during read 
> repair with either an empty slice or with paging and the final page being 
> empty.
> This creates problems in RTL if we try to insert a new range tombstone which 
> covers the empty ones;
> {code}
> Caused by: java.lang.AssertionError
>   at 
> org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541)
>   at 
> org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217)
>   at 
> org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141)
>   at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137)
>   at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
>   at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210)
>   at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-9739) Migrate counter-cache to be fully off-heap

2020-07-06 Thread Aleksey Yeschenko (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152139#comment-17152139
 ] 

Aleksey Yeschenko commented on CASSANDRA-9739:
--

Sure

> Migrate counter-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9739
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9739
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Core
>Reporter: Robert Stupp
>Assignee: Robert Stupp
>Priority: Normal
> Fix For: 4.x
>
>
> Counter cache still uses a concurrent map on-heap. This could go to off-heap 
> and feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair

2020-07-06 Thread Jordan West (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152137#comment-17152137
 ] 

Jordan West commented on CASSANDRA-15579:
-

No objection to splitting. I think this was intended as a parent to sub-tasks 
with more specific scope. 

> 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, 
> and Read Repair
> 
>
> Key: CASSANDRA-15579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15579
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Josh McKenzie
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Blake Eggleston*
> Testing in this area focuses on non-node-local aspects of the read-write 
> path: coordination, replication, read repair, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152134#comment-17152134
 ] 

Andres de la Peña edited comment on CASSANDRA-15579 at 7/6/20, 4:21 PM:


bq. Also, just a structural nit, we could easily break this Jira into two...one 
dealing with coordination/replication and the other dealing with read repair.

+1 to breaking it into two, that would help us to reduce the potentially vast 
scope of the ticket.


was (Author: adelapena):
> Also, just a structural nit, we could easily break this Jira into two...one 
> dealing with coordination/replication and the other dealing with read repair.

+1 to breaking it into two, that would help us to reduce the potentially vast 
scope of the ticket.

> 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, 
> and Read Repair
> 
>
> Key: CASSANDRA-15579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15579
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Josh McKenzie
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Blake Eggleston*
> Testing in this area focuses on non-node-local aspects of the read-write 
> path: coordination, replication, read repair, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152134#comment-17152134
 ] 

Andres de la Peña commented on CASSANDRA-15579:
---

> Also, just a structural nit, we could easily break this Jira into two...one 
> dealing with coordination/replication and the other dealing with read repair.

+1 to breaking it into two, that would help us to reduce the potentially vast 
scope of the ticket.

> 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, 
> and Read Repair
> 
>
> Key: CASSANDRA-15579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15579
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Josh McKenzie
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Blake Eggleston*
> Testing in this area focuses on non-node-local aspects of the read-write 
> path: coordination, replication, read repair, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15579) 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, and Read Repair



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152125#comment-17152125
 ] 

Caleb Rackliffe commented on CASSANDRA-15579:
-

Also, just a structural nit, we could easily break this Jira into two...one 
dealing with coordination/replication and the other dealing with read repair.

> 4.0 quality testing: Distributed Read/Write Path: Coordination, Replication, 
> and Read Repair
> 
>
> Key: CASSANDRA-15579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15579
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Josh McKenzie
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Reference [doc from 
> NGCC|https://docs.google.com/document/d/1uhUOp7wpE9ZXNDgxoCZHejHt5SO4Qw1dArZqqsJccyQ/edit#]
>  for context.
> *Shepherd: Blake Eggleston*
> Testing in this area focuses on non-node-local aspects of the read-write 
> path: coordination, replication, read repair, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-15924:

 Bug Category: Parent values: Availability(12983)Level 1 values: Response 
Crash(12991)
   Complexity: Normal
  Component/s: Consistency/Coordination
Discovered By: Unit Test
Fix Version/s: 4.x
   3.11.x
   3.0.x
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Avoid emitting empty range tombstones from RangeTombstoneList
> -
>
> Key: CASSANDRA-15924
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15924
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> In {{RangeTombstoneList#iterator}} there is a chance we emit empty range 
> tombstones depending on the slice passed in. This can happen during read 
> repair with either an empty slice or with paging and the final page being 
> empty.
> This creates problems in RTL if we try to insert a new range tombstone which 
> covers the empty ones;
> {code}
> Caused by: java.lang.AssertionError
>   at 
> org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541)
>   at 
> org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217)
>   at 
> org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141)
>   at 
> org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137)
>   at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
>   at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210)
>   at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
>   at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
>   at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582)
>   at 
> org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-15924) Avoid emitting empty range tombstones from RangeTombstoneList

Marcus Eriksson created CASSANDRA-15924:
---

 Summary: Avoid emitting empty range tombstones from 
RangeTombstoneList
 Key: CASSANDRA-15924
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15924
 Project: Cassandra
  Issue Type: Bug
Reporter: Marcus Eriksson
Assignee: Marcus Eriksson


In {{RangeTombstoneList#iterator}} there is a chance we emit empty range 
tombstones depending on the slice passed in. This can happen during read repair 
with either an empty slice or with paging and the final page being empty.

This creates problems in RTL if we try to insert a new range tombstone which 
covers the empty ones;
{code}
Caused by: java.lang.AssertionError
at 
org.apache.cassandra.db.RangeTombstoneList.insertFrom(RangeTombstoneList.java:541)
at 
org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:217)
at 
org.apache.cassandra.db.MutableDeletionInfo.add(MutableDeletionInfo.java:141)
at 
org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:137)
at org.apache.cassandra.db.Memtable.put(Memtable.java:254)
at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1210)
at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:573)
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:421)
at org.apache.cassandra.db.Mutation.apply(Mutation.java:210)
at org.apache.cassandra.db.Mutation.apply(Mutation.java:215)
at org.apache.cassandra.db.Mutation.apply(Mutation.java:224)
at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternalWithoutCondition(ModificationStatement.java:582)
at 
org.apache.cassandra.cql3.statements.ModificationStatement.executeInternal(ModificationStatement.java:572)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152110#comment-17152110
 ] 

Benedict Elliott Smith commented on CASSANDRA-15234:


> There are no true urgency to fix that ticket so if we want to go the grouping 
> way within the scope of that ticket we should move it to 4.X.

I think this is indeed preferable to releasing an API we already expect to 
deprecate, however I think we're overstating the difficulty here.  We haven't 
debated the parameter naming much at all, and we can easily land this in 
4.0-beta.  If [~e.dimitrova] doesn't have the time, and 4.0-beta is an 
acceptable window to land the work, I can take a look in a few weeks.


> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152103#comment-17152103
 ] 

Benedict Elliott Smith commented on CASSANDRA-15234:


> We have un-tested beliefs about a potentially superior design

We don't ordinarily label our judgements about design "un-tested beliefs" and I 
think it would help to avoid this kind of rhetoric.  If we all start labelling 
design decisions in this way the project might grind to a halt.  I have anyway 
tried specifically to sidestep this kind of accusation, by leaving the ball in 
your court.  I am simply asking those pushing to move ahead with the current 
proposal to endorse the view that it is superior.  This is a very weak criteria 
to meet, and involves no beliefs external to yourselves.


> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters

2020-07-06 Thread Benjamin Lerer (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152102#comment-17152102
 ] 

Benjamin Lerer commented on CASSANDRA-15234:


Argeeing on grouping will take a significant amount of time. Specially now 
where a lot of people are pretty busy with other tasks.
There are no true urgency to fix that ticket so if we want to go the grouping 
way within the scope of that ticket we should move it to 4.X.

 
 




> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15889) Debian package fails to download on Arm-based hosts

2020-07-06 Thread Matt Davis (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152071#comment-17152071
 ] 

Matt Davis commented on CASSANDRA-15889:


Just checking if there's been any movement here, thanks!

> Debian package fails to download on Arm-based hosts
> ---
>
> Key: CASSANDRA-15889
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15889
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Matt Davis
>Priority: Normal
>
> Following the first three steps of the [Debian install 
> process|https://cassandra.apache.org/download/], after an apt-get update 
> you'll see this line:
> {code:bash}
> $ sudo apt-get update
> ...
> N: Skipping acquire of configured file 'main/binary-arm64/Packages' as 
> repository 'https://downloads.apache.org/cassandra/debian 311x InRelease' 
> doesn't support architecture 'arm64'
> {code}
> Checking the [Debian 
> repo|https://dl.bintray.com/apache/cassandra/dists/311x/main/] confirms there 
> is no aarch64 variant available.
> Should you then attempt to install Cassandra:
> {code:bash}
> $ sudo apt-get install cassandra
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> Package cassandra is not available, but is referred to by another package.
> This may mean that the package is missing, has been obsoleted, or
> is only available from another source
> E: Package 'cassandra' has no installation candidate
> {code}
> The Redhat RPM contains a "noarch" arch type, so it will download on any 
> host. (Cassandra does not use separate binaries/releases for different 
> architectures, so this seems to be the correct approach, but adding an 
> aarch64 variant would also suffice.)
> Note that there is a workaround available: if you specify "amd64" as the arch 
> for the source, it downloads and runs on Arm without issue:
> {code:bash}
> deb [arch=amd64] https://downloads.apache.org/cassandra/debian 311x main
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-9555) Don't let offline tools run while cassandra is running



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp reassigned CASSANDRA-9555:
---

Assignee: (was: Robert Stupp)

> Don't let offline tools run while cassandra is running
> --
>
> Key: CASSANDRA-9555
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9555
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/Tools
>Reporter: Marcus Eriksson
>Priority: Low
> Fix For: 4.x
>
>
> We should not let offline tools that modify sstables run while Cassandra is 
> running. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-13838) Ensure all threads are FastThreadLocal.removeAll() is called for all threads



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-13838:
-
Status: Open  (was: Patch Available)

> Ensure all threads are FastThreadLocal.removeAll() is called for all threads
> 
>
> Key: CASSANDRA-13838
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13838
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Legacy/Core
>Reporter: Robert Stupp
>Assignee: Robert Stupp
>Priority: Normal
>
> There are a couple of places, there it's not guaranteed that 
> FastThreadLocal.removeAll() is called. Most misses are actually not that 
> critical, but the miss for the thread created via in 
> org.apache.cassandra.streaming.ConnectionHandler.MessageHandler#start(java.net.Socket,
>  int, boolean) could be critical, because these threads are created for every 
> stream-session.
> (Follow-up from CASSANDRA-13754)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-9739) Migrate counter-cache to be fully off-heap



[ 
https://issues.apache.org/jira/browse/CASSANDRA-9739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152061#comment-17152061
 ] 

Robert Stupp commented on CASSANDRA-9739:
-

[~aleksey] do you mind if i close this one?

> Migrate counter-cache to be fully off-heap
> --
>
> Key: CASSANDRA-9739
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9739
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Core
>Reporter: Robert Stupp
>Assignee: Robert Stupp
>Priority: Normal
> Fix For: 4.x
>
>
> Counter cache still uses a concurrent map on-heap. This could go to off-heap 
> and feels doable now after CASSANDRA-8099.
> Evaluation should be done in advance based on a POC to prove that pure 
> off-heap counter cache buys a performance and/or gc-pressure improvement.
> In theory, elimination of on-heap management of the map should buy us some 
> benefit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-9454) Log WARN on Multi Partition IN clause Queries



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-9454:

Reviewers:   (was: Robert Stupp)

> Log WARN on Multi Partition IN clause Queries
> -
>
> Key: CASSANDRA-9454
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9454
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Legacy/CQL
>Reporter: Sebastian Estevez
>Assignee: T Jake Luciani
>Priority: Low
> Fix For: 2.2.x
>
>
> Similar to CASSANDRA-6487 but for multi-partition queries.
> Show warning (ideally at the client CASSANDRA-8930) when users try to use IN 
> clauses when clustering columns span multiple partitions. The right way to go 
> is async requests per partition.
> **Update**: Unless the query is CL.ONE and all the partition ranges are on 
> the node! In which case multi partition IN is okay.
> This can cause an OOM
> {code}
> ERROR [Thread-388] 2015-05-18 12:11:10,147 CassandraDaemon.java (line 199) 
> Exception in thread Thread[Thread-388,5,main]
> java.lang.OutOfMemoryError: Java heap space
> ERROR [ReadStage:321] 2015-05-18 12:11:10,147 CassandraDaemon.java (line 199) 
> Exception in thread Thread[ReadStage:321,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
> at 
> org.apache.cassandra.io.util.MappedFileDataInput.readBytes(MappedFileDataInput.java:146)
> at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:392)
> at 
> org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:371)
> at 
> org.apache.cassandra.io.sstable.IndexHelper$IndexInfo.deserialize(IndexHelper.java:187)
> at 
> org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:122)
> at 
> org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:970)
> at 
> org.apache.cassandra.io.sstable.SSTableReader.getPosition(SSTableReader.java:871)
> at 
> org.apache.cassandra.db.columniterator.SSTableSliceIterator.(SSTableSliceIterator.java:41)
> at 
> org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:167)
> at 
> org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:62)
> at 
> org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:250)
> at 
> org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:53)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1547)
> at 
> org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1376)
> at org.apache.cassandra.db.Keyspace.getRow(Keyspace.java:327)
> at 
> org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:65)
> at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:47)
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
> {code}
> By flooding heap with:
> {code}org.apache.cassandra.io.sstable.IndexHelper$IndexInfo{code}
> taken from:
> http://stackoverflow.com/questions/30366729/out-of-memory-error-in-cassandra-when-querying-big-rows-containing-a-collection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-15922:
-
Reviewers: Benedict Elliott Smith, Robert Stupp, Robert Stupp  (was: 
Benedict Elliott Smith, Robert Stupp)
   Benedict Elliott Smith, Robert Stupp, Robert Stupp  (was: 
Benedict Elliott Smith, Robert Stupp)
   Status: Review In Progress  (was: Patch Available)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), us

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Stupp updated CASSANDRA-15922:
-
Reviewers: Benedict Elliott Smith, Robert Stupp

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further sig

[jira] [Assigned] (CASSANDRA-15923) Collection types written via prepared statement not checked for nulls



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña reassigned CASSANDRA-15923:
-

Assignee: Andres de la Peña

> Collection types written via prepared statement not checked for nulls
> -
>
> Key: CASSANDRA-15923
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15923
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Tom van der Woerdt
>Assignee: Andres de la Peña
>Priority: Normal
>
> To reproduce:
> {code:java}
> >>> cluster = Cluster()
> >>> session = cluster.connect()
> >>> session.execute("create keyspace frozen_int_test with replication = 
> >>> {'class': 'SimpleStrategy', 'replication_factor': 1}")
> >>> session.execute("create table frozen_int_test.mytable (id int primary 
> >>> key, value frozen>)")
> >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, 
> >>> value) values (?, ?)"), (1, [1,2,3]))
> >>> list(session.execute("select * from frozen_int_test.mytable"))
> [Row(id=1, value=[1, 2, 3])]
> >>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, 
> >>> value) values (?, ?)"), (1, [1,2,None]))
> >>> list(session.execute("select * from frozen_int_test.mytable"))
> [Row(id=1, value=[1, 2, None])] {code}
> Now you might say "But Tom, that just shows that it works!", but this does 
> not work as a CQL literal:
> {code:java}
> >>> session.execute("insert into frozen_int_test.mytable (id, value) values 
> >>> (1, [1,2,null])")
> [...] cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] 
> message="null is not supported inside collections" {code}
> Worse, if a mutation like this makes it way into the hints, it will be 
> retried indefinitely as it fails validation with a NullPointerException:
> {code:java}
> ERROR [MutationStage-11] 2020-07-06 09:23:25,696 
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[MutationStage-11,5,main]
> java.lang.NullPointerException: null
> at 
> org.apache.cassandra.serializers.Int32Serializer.validate(Int32Serializer.java:41)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.serializers.ListSerializer.validateForNativeProtocol(ListSerializer.java:70)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.serializers.CollectionSerializer.validate(CollectionSerializer.java:56)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.db.marshal.AbstractType.validate(AbstractType.java:162) 
> ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.db.marshal.AbstractType.validateCellValue(AbstractType.java:196)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.db.marshal.CollectionType.validateCellValue(CollectionType.java:124)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.config.ColumnDefinition.validateCell(ColumnDefinition.java:410)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.db.rows.AbstractCell.validate(AbstractCell.java:154) 
> ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.db.partitions.PartitionUpdate.validate(PartitionUpdate.java:486)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at java.util.Collections$SingletonSet.forEach(Collections.java:4769) 
> ~[na:1.8.0_252]
> at 
> org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:69) 
> ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
> ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_252]
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
>  ~[apache-cassandra-3.11.6.jar:3.11.6]
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
>  [apache-cassandra-3.11.6.jar:3.11.6]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) 
> [apache-cassandra-3.11.6.jar:3.11.6]
> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252] {code}
> A similar problem is reproducible when writing into a non-frozen column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters

2020-07-06 Thread Josh McKenzie (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152050#comment-17152050
 ] 

Josh McKenzie commented on CASSANDRA-15234:
---

{quote}{color:#172b4d}It is customary that, before work is committed, 
alternative proposals are engaged with on their technical merits. It occurs to 
me we recently worked to mandate this as part of the process, in fact. In this 
case we {color}_seem_{color:#172b4d} in danger of subordinating this to beliefs 
about scheduling.{color}
{quote}
{color:#172b4d}Unfortunately it's customary on this project to do that right up 
until the last moment before something is committed (and even beyond) with no 
weighting of the value on things actually being in the hands of users vs. in 
tree and unreleased.{color}

{color:#172b4d}We have un-tested beliefs about a potentially superior design 
against un-tested beliefs about the negative impact of the project on further 
delay. This is not a situation in which we can expect to make progress on a 
discussion until and unless both sides collect some empirical evidence about 
their position as well as spend real time investigating and exploring the 
position of other people engaged.{color}

{color:#172b4d}Unfortunately I'm well past the time I personally have available 
to engage on this ticket; I'll defer to other people to take it from 
here.{color}

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-10968) When taking snapshot, manifest.json contains incorrect or no files when column family has secondary indexes



[ 
https://issues.apache.org/jira/browse/CASSANDRA-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152040#comment-17152040
 ] 

Andres de la Peña commented on CASSANDRA-10968:
---

The fix looks good to me. I have run CI again:

||branch||utest||dtest||
|2.1 
|[167|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/167/]|[201|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/201/]|
|2.2 
|[168|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/168/]|[202|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/202/]|
|3.0 
|[169|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/169/]|[203|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/203/]|
|3.11|[170|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/170/]|[204|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/204/]|
|4.0 
|[171|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-test/171/]|[205|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/205/]|

Dtests haven't finished yet. 2.1 CI seems to have failed to build, and indeed 
we don't seem to have a regular CI build for it. For 2.2 there are failures in 
{{SSTableRewriterTest}} that also happen in the base branch, so they don't seem 
related.

For sure we should at least apply the fix since 2.2. I'm not sure this is 
critical enough for 2.1, but the fix is quite small so probably it won't be a 
problem if we include that branch, although the lack of CI for the branch is a 
bit worrying.


> When taking snapshot, manifest.json contains incorrect or no files when 
> column family has secondary indexes
> ---
>
> Key: CASSANDRA-10968
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10968
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/2i Index
>Reporter: Fred A
>Assignee: Aleksandr Sorokoumov
>Priority: Normal
>  Labels: lhf
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> xNoticed indeterminate behaviour when taking snapshot on column families that 
> has secondary indexes setup. The created manifest.json created when doing 
> snapshot, sometimes contains no file names at all and sometimes some file 
> names. 
> I don't know if this post is related but that was the only thing I could find:
> http://www.mail-archive.com/user%40cassandra.apache.org/msg42019.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15406) Add command to show the progress of data streaming and index build



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152024#comment-17152024
 ] 

Berenguer Blasi edited comment on CASSANDRA-15406 at 7/6/20, 1:48 PM:
--

Ok so here's an opportunity for CASSANDRA-15502 or even as the quality 4.0 
effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending 
CI lgtm.


was (Author: bereng):
Ok so here's an opportunity for CASSANDRA-15502 or even as they quality 4.0 
effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending 
CI lgtm.

> Add command to show the progress of data streaming and index build 
> ---
>
> Key: CASSANDRA-15406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15406
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Tool/nodetool
>Reporter: maxwellguo
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0, 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that we should supply a command to show the progress of streaming 
> when we do the operation of bootstrap/move/decommission/removenode. For when 
> do data streaming , noboday knows which steps there program are in , so I 
> think a command to show the joing/leaving node's is needed .
>  
> PR [https://github.com/apache/cassandra/pull/558]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152024#comment-17152024
 ] 

Berenguer Blasi commented on CASSANDRA-15406:
-

Ok so here's an opportunity for CASSANDRA-15502 or even as they quality 4.0 
effort. Some scaffolding for cmd line tooling testing. Thx otherwise, pending 
CI lgtm.

> Add command to show the progress of data streaming and index build 
> ---
>
> Key: CASSANDRA-15406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15406
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Tool/nodetool
>Reporter: maxwellguo
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0, 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that we should supply a command to show the progress of streaming 
> when we do the operation of bootstrap/move/decommission/removenode. For when 
> do data streaming , noboday knows which steps there program are in , so I 
> think a command to show the joing/leaving node's is needed .
>  
> PR [https://github.com/apache/cassandra/pull/558]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15406) Add command to show the progress of data streaming and index build



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152011#comment-17152011
 ] 

Berenguer Blasi edited comment on CASSANDRA-15406 at 7/6/20, 1:42 PM:
--

Would it make sense to add some testing either in a junit hijacking 
{{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even 
in {{nodetool_test.py}} dtest? I am aware the nodetool tool doesn't have this 
sort of testing atm (only some dtests) so we can use this as an opportunity to 
kickstart now. You can also tell me you'd rather do this in another ticket as 
it's not a quick fix :-) depending on how OCD on testing you feel on this one.


was (Author: bereng):
Would it make sense to add some testing either in a junit hijacking 
{{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even 
in {{nodetool_test.py}} dtest. I am aware the nodetool tool doesn't have this 
sort of testing atm (only some dtests) so we can use this as an opportunity to 
kickstart now. You can also tell me you'd rather do this in another ticket as 
it's not a quick fix :-) depending on how OCD on testing you feel on this one.

> Add command to show the progress of data streaming and index build 
> ---
>
> Key: CASSANDRA-15406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15406
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Tool/nodetool
>Reporter: maxwellguo
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0, 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that we should supply a command to show the progress of streaming 
> when we do the operation of bootstrap/move/decommission/removenode. For when 
> do data streaming , noboday knows which steps there program are in , so I 
> think a command to show the joing/leaving node's is needed .
>  
> PR [https://github.com/apache/cassandra/pull/558]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 1:23 PM:
-

Patch updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values in the {{toString(..)}} method (when 
region is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

CI run at 
https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline


was (Author: michaelsembwever):
Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values in the {{toString(..)}} method (when 
region is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

CI run at 
https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two sc

[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build

2020-07-06 Thread Stefan Miklosovic (Jira)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152014#comment-17152014
 ] 

Stefan Miklosovic commented on CASSANDRA-15406:
---

My OCD here is pretty weak, I would just move it to other PR, I was talking 
about testing of nodetools output with [~ifesdjeen] and how we could go about 
that in testing framework but no luck so far ... 

> Add command to show the progress of data streaming and index build 
> ---
>
> Key: CASSANDRA-15406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15406
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Tool/nodetool
>Reporter: maxwellguo
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0, 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that we should supply a command to show the progress of streaming 
> when we do the operation of bootstrap/move/decommission/removenode. For when 
> do data streaming , noboday knows which steps there program are in , so I 
> think a command to show the joing/leaving node's is needed .
>  
> PR [https://github.com/apache/cassandra/pull/558]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15406) Add command to show the progress of data streaming and index build



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152011#comment-17152011
 ] 

Berenguer Blasi commented on CASSANDRA-15406:
-

Would it make sense to add some testing either in a junit hijacking 
{{System.out}} and mocking things around in {{NodeToolTest.java}} i.e. Or even 
in {{nodetool_test.py}} dtest. I am aware the nodetool tool doesn't have this 
sort of testing atm (only some dtests) so we can use this as an opportunity to 
kickstart now. You can also tell me you'd rather do this in another ticket as 
it's not a quick fix :-) depending on how OCD on testing you feel on this one.

> Add command to show the progress of data streaming and index build 
> ---
>
> Key: CASSANDRA-15406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15406
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Tool/nodetool
>Reporter: maxwellguo
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 4.0, 4.x
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I found that we should supply a command to show the progress of streaming 
> when we do the operation of bootstrap/move/decommission/removenode. For when 
> do data streaming , noboday knows which steps there program are in , so I 
> think a command to show the joing/leaving node's is needed .
>  
> PR [https://github.com/apache/cassandra/pull/558]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152010#comment-17152010
 ] 

Benedict Elliott Smith commented on CASSANDRA-15922:


+1

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher t

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152006#comment-17152006
 ] 

Robert Stupp commented on CASSANDRA-15922:
--

+1 (assuming CI looks good and 3.11+3.0 back-ports are clean)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster.

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:51 PM:
--

Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values in the {{toString(..)}} method (when 
region is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

CI run at 
https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline


was (Author: michaelsembwever):
Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

CI run at 
https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two s

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:51 PM:
--

Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

CI run at 
https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/197/pipeline


was (Author: michaelsembwever):
Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionC

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:42 PM:
--

Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1


was (Author: michaelsembwever):
Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (r

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151993#comment-17151993
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


Patched updated to
- use {{getAndAdd(..)}} instead of {{addAndGet(..)}} for readability
 - remove the {{allocCount}} AtomicInteger field
 - don't print negative waste values the {{toString(..)}} method (when region 
is full and nextFreeOffset is passed capacity)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Test and Documentation Plan: 
existing CI. 
benchmarking in ticket.

  was:existing CI. 

 Status: Patch Available  (was: In Progress)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151970#comment-17151970
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 12:04 PM:
--

h4. {{addAndGet}} Experiments

Code patch at 
https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1
This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result 
that returns the value from no overlapping/latter add calls. AFAIK that is how 
AtomicInteger works.

I'm also curious if we still need the {{allocCount}} AtomicInteger field, it 
appears to be there only for debug. May I remove it in this patch?

Benchmark code attached in  [^NativeAllocatorRegion2Test.java].

The following attached screenshot shows the time it takes to fill a Region 
(~215 million allocations), using different threads, comparing the original 
code (compareAndSet), the addAndGet, and the constant backoff (parkNano) 
approaches. 

The biggest improvement is still the constant backoff algorithm where 
performance is one order of magnitude faster. 

But the addAndGet approach is 2x to 5x faster than the original, and as 
mentioned above it also comes with the benefit of no-loop (no starvation) and 
faster performance in all workloads.

 !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! 

was (Author: michaelsembwever):
h4. {{addAndGet}} Experiments

Code patch at 
https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1
This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result 
that returns the value from no overlapping/latter add calls. AFAIK that is how 
AtomicInteger works.

I'm also curious if we still need the {{allocCount}} AtomicInteger field, it 
appears to be there only for debug. May I remove it in this patch?

Benchmark code attached in  [^NativeAllocatorRegion2Test.java].

The following attached screenshot shows the time it takes to fill a Region 
(~215 million allocations), using different threads, comparing the original 
code (compareAndSet), the addAndGet, and the constant backoff (parkNano) 
approaches. 

The biggest improvement is still the algorithm with a park time of 1ns where 
performance is one order of magnitude faster. The addAndGet approach is 2x to 
5x faster than the original. As mentioned above it also comes with the benefit 
of no-loop (no starvation) and faster performance in all workloads.

 !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! 

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simpl

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151970#comment-17151970
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


h4. {{addAndGet}} Experiments

Code patch at 
https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_15922_1
This patch depends on the {{addAndGet(..)}} call guaranteeing a (serial) result 
that returns the value from no overlapping/latter add calls. AFAIK that is how 
AtomicInteger works.

I'm also curious if we still need the {{allocCount}} AtomicInteger field, it 
appears to be there only for debug. May I remove it in this patch?

Benchmark code attached in  [^NativeAllocatorRegion2Test.java].

The following attached screenshot shows the time it takes to fill a Region 
(~215 million allocations), using different threads, comparing the original 
code (compareAndSet), the addAndGet, and the constant backoff (parkNano) 
approaches. 

The biggest improvement is still the algorithm with a park time of 1ns where 
performance is one order of magnitude faster. The addAndGet approach is 2x to 
5x faster than the original. As mentioned above it also comes with the benefit 
of no-loop (no starvation) and faster performance in all workloads.

 !Screen Shot 2020-07-06 at 13.26.10.png|width=600px! 

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllo

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151969#comment-17151969
 ] 

Robert Stupp commented on CASSANDRA-15922:
--

+1 on {{addAndGet}} (or {{getAndAdd}}, whichever works best).

And I agree, the allocation-model that we currently have is not great, but as 
you said, it's a ton of work to get it right (less (ideally no) fragmentation, 
no unnecessary tiny allocations, no unnecessary copying, etc etc).

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Attachment: NativeAllocatorRegion2Test.java

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegion2Test.java, 
> NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 at 13.16.10.png, 
> Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 
> 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 
> at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, Screen Shot 
> 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 13.26.10.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even fur

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Attachment: Screen Shot 2020-07-06 at 13.26.10.png

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, Screen Shot 2020-07-06 at 
> 13.26.10.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant drop, especi

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151952#comment-17151952
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 10:53 AM:
--

bq.  I assume in this case one of the problems is that we are allocating huge 
numbers of small objects, so that a small number of threads are competing 
over-and-over again to allocate the same data. We should not be competing for 
each Cell allocation, and instead try to allocate all the buffers for e.g. at 
least a Row at once. 

This is correct. Rows with ~many hundreds of double cells.

bq. There is perhaps a better alternative: use addAndGet->if instead of 
read->if->compareAndSet, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath. This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

Sounds good. Will put it together and test.

was (Author: michaelsembwever):
bq.  I assume in this case one of the problems is that we are allocating huge 
numbers of small objects, so that a small number of threads are competing 
over-and-over again to allocate the same data. We should not be competing for 
each Cell allocation, and instead try to allocate all the buffers for e.g. at 
least a Row at once. 

This is correct. Rows with ~thousands of double cells.

bq. There is perhaps a better alternative: use addAndGet->if instead of 
read->if->compareAndSet, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath. This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

Sounds good. Will put it together and test.

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151952#comment-17151952
 ] 

Michael Semb Wever commented on CASSANDRA-15922:


bq.  I assume in this case one of the problems is that we are allocating huge 
numbers of small objects, so that a small number of threads are competing 
over-and-over again to allocate the same data. We should not be competing for 
each Cell allocation, and instead try to allocate all the buffers for e.g. at 
least a Row at once. 

This is correct. Rows with ~thousands of double cells.

bq. There is perhaps a better alternative: use addAndGet->if instead of 
read->if->compareAndSet, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath. This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

Sounds good. Will put it together and test.

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement

[jira] [Commented] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151948#comment-17151948
 ] 

Benedict Elliott Smith commented on CASSANDRA-15234:


bq. The valid concern around the api churn it introduces is addressed by 
committing the new ticket to 4.0-beta.

This ticket’s API replaces the current API, is mutually exclusive with the 
alternative proposal, and would be deprecated by it.  If we introduce them both 
in 4.0-beta, we must maintain them both and go through the full deprecation 
process.  So unfortunately no churn is avoided.

> I'd be curious what your perspective is on how we determine what qualifies as 
> justified

It is customary that, before work is committed, alternative proposals are 
engaged with on their technical merits.  It occurs to me we recently worked to 
mandate this as part of the process, in fact.  In this case we _seem_ in danger 
of subordinating this to beliefs about scheduling.

If you like, I can formulate a legally airtight veto, but my goal is only for 
you to engage briefly with the proposal and determine for yourselves which is 
superior.  If the new proposal is _technically_* superior, and of similar 
complexity, then you are my justification.  If you disagree, however - and 
importantly we agree that we do not intend to pursue the alternative approach 
in future - I would consider my veto invalid (and would anyway withdraw it).

> having heard of the proximity of the beta

Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? 
 This should surely alleviate concerns around delaying 4.0?  I do understand 
the imperative to get 4.0 out the door, but I also know we all want to ship the 
best product we can as well.  If we can achieve both, we should. APIs matter, 
and avoiding API churn is an important part of our user/operator story.

* I _hope_ we can avoid an epistemic battle about the word "technical," and 
accept that API design is a technical endeavour to convey meaning.

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
> s(econds?)?
> m(inutes?)?
> h(ours?)?
> d(ays?)?
> mo(nths?)?
> {{code}}
> For rate units, I would propose parsing any of the standard {{B/s, KiB/s, 
> MiB/s, GiB/s, TiB/s}}.
> Perhaps for avoiding ambiguity we could not accept bauds {{bs, Mbps}} or 
> powers of 1000 such as {{KB/s}}, given these are regularly used for either 
> their old or new definition e.g. {{KiB/s}}, or we could support them and 
> simply log the value in bytes/s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-15234) Standardise config and JVM parameters



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151948#comment-17151948
 ] 

Benedict Elliott Smith edited comment on CASSANDRA-15234 at 7/6/20, 10:38 AM:
--

bq. The valid concern around the api churn it introduces is addressed by 
committing the new ticket to 4.0-beta.

This ticket’s API replaces the current API, is mutually exclusive with the 
alternative proposal, and would be deprecated by it.  If we introduce them both 
in 4.0-beta, we must maintain them both and go through the full deprecation 
process.  So unfortunately no churn is avoided.

> I'd be curious what your perspective is on how we determine what qualifies as 
> justified

It is customary that, before work is committed, alternative proposals are 
engaged with on their technical merits.  It occurs to me we recently worked to 
mandate this as part of the process, in fact.  In this case we _seem_ in danger 
of subordinating this to beliefs about scheduling.

If you like, I can formulate a legally airtight veto, but my goal is only for 
you to engage briefly with the proposal and determine for yourselves which is 
superior.  If the new proposal is _technically_* superior, and of similar 
complexity, then you are my justification.  If you disagree, however - and 
importantly we agree that we do not intend to pursue the alternative approach 
in future - I would consider my veto invalid (and would anyway withdraw it).

> having heard of the proximity of the beta

Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? 
 This should surely alleviate concerns around delaying 4.0?  I do understand 
the imperative to get 4.0 out the door, but I also know we all want to ship the 
best product we can as well.  If we can achieve both, we should. APIs matter, 
and avoiding API churn is an important part of our user/operator story.

\* I _hope_ we can avoid an epistemic battle about the word "technical," and 
accept that API design is a technical endeavour to convey meaning.


was (Author: benedict):
bq. The valid concern around the api churn it introduces is addressed by 
committing the new ticket to 4.0-beta.

This ticket’s API replaces the current API, is mutually exclusive with the 
alternative proposal, and would be deprecated by it.  If we introduce them both 
in 4.0-beta, we must maintain them both and go through the full deprecation 
process.  So unfortunately no churn is avoided.

> I'd be curious what your perspective is on how we determine what qualifies as 
> justified

It is customary that, before work is committed, alternative proposals are 
engaged with on their technical merits.  It occurs to me we recently worked to 
mandate this as part of the process, in fact.  In this case we _seem_ in danger 
of subordinating this to beliefs about scheduling.

If you like, I can formulate a legally airtight veto, but my goal is only for 
you to engage briefly with the proposal and determine for yourselves which is 
superior.  If the new proposal is _technically_* superior, and of similar 
complexity, then you are my justification.  If you disagree, however - and 
importantly we agree that we do not intend to pursue the alternative approach 
in future - I would consider my veto invalid (and would anyway withdraw it).

> having heard of the proximity of the beta

Perhaps we can also directly address people’s thoughts on deferral to 4.0-beta? 
 This should surely alleviate concerns around delaying 4.0?  I do understand 
the imperative to get 4.0 out the door, but I also know we all want to ship the 
best product we can as well.  If we can achieve both, we should. APIs matter, 
and avoiding API churn is an important part of our user/operator story.

* I _hope_ we can avoid an epistemic battle about the word "technical," and 
accept that API design is a technical endeavour to convey meaning.

> Standardise config and JVM parameters
> -
>
> Key: CASSANDRA-15234
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15234
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Benedict Elliott Smith
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 4.0-alpha
>
> Attachments: CASSANDRA-15234-3-DTests-JAVA8.txt
>
>
> We have a bunch of inconsistent names and config patterns in the codebase, 
> both from the yams and JVM properties.  It would be nice to standardise the 
> naming (such as otc_ vs internode_) as well as the provision of values with 
> units - while maintaining perpetual backwards compatibility with the old 
> parameter names, of course.
> For temporal units, I would propose parsing strings with suffixes of:
> {{code}}
> u|micros(econds?)?
> ms|millis(econds?)?
>

[jira] [Created] (CASSANDRA-15923) Collection types written via prepared statement not checked for nulls

2020-07-06 Thread Tom van der Woerdt (Jira)

Tom van der Woerdt created CASSANDRA-15923:
--

 Summary: Collection types written via prepared statement not 
checked for nulls
 Key: CASSANDRA-15923
 URL: https://issues.apache.org/jira/browse/CASSANDRA-15923
 Project: Cassandra
  Issue Type: Bug
  Components: Messaging/Client
Reporter: Tom van der Woerdt


To reproduce:
{code:java}
>>> cluster = Cluster()

>>> session = cluster.connect()

>>> session.execute("create keyspace frozen_int_test with replication = 
>>> {'class': 'SimpleStrategy', 'replication_factor': 1}")

>>> session.execute("create table frozen_int_test.mytable (id int primary key, 
>>> value frozen>)")

>>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, 
>>> value) values (?, ?)"), (1, [1,2,3]))

>>> list(session.execute("select * from frozen_int_test.mytable"))
[Row(id=1, value=[1, 2, 3])]

>>> session.execute(session.prepare("insert into frozen_int_test.mytable (id, 
>>> value) values (?, ?)"), (1, [1,2,None]))

>>> list(session.execute("select * from frozen_int_test.mytable"))
[Row(id=1, value=[1, 2, None])] {code}
Now you might say "But Tom, that just shows that it works!", but this does not 
work as a CQL literal:
{code:java}
>>> session.execute("insert into frozen_int_test.mytable (id, value) values (1, 
>>> [1,2,null])")
[...] cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] 
message="null is not supported inside collections" {code}
Worse, if a mutation like this makes it way into the hints, it will be retried 
indefinitely as it fails validation with a NullPointerException:
{code:java}
ERROR [MutationStage-11] 2020-07-06 09:23:25,696 
AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[MutationStage-11,5,main]
java.lang.NullPointerException: null
at 
org.apache.cassandra.serializers.Int32Serializer.validate(Int32Serializer.java:41)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.serializers.ListSerializer.validateForNativeProtocol(ListSerializer.java:70)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.serializers.CollectionSerializer.validate(CollectionSerializer.java:56)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.db.marshal.AbstractType.validate(AbstractType.java:162) 
~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.db.marshal.AbstractType.validateCellValue(AbstractType.java:196)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.db.marshal.CollectionType.validateCellValue(CollectionType.java:124)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.config.ColumnDefinition.validateCell(ColumnDefinition.java:410)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.db.rows.AbstractCell.validate(AbstractCell.java:154) 
~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.db.partitions.PartitionUpdate.validate(PartitionUpdate.java:486)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at java.util.Collections$SingletonSet.forEach(Collections.java:4769) 
~[na:1.8.0_252]
at 
org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:69) 
~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) 
~[apache-cassandra-3.11.6.jar:3.11.6]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_252]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
 ~[apache-cassandra-3.11.6.jar:3.11.6]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
 [apache-cassandra-3.11.6.jar:3.11.6]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) 
[apache-cassandra-3.11.6.jar:3.11.6]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252] {code}
A similar problem is reproducible when writing into a non-frozen column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Status: In Progress  (was: Patch Available)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant drop, especially at low contention rates.
>  !Screen Shot 2020

[jira] [Updated] (CASSANDRA-10968) When taking snapshot, manifest.json contains incorrect or no files when column family has secondary indexes



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-10968:
--
Reviewers: Andres de la Peña

> When taking snapshot, manifest.json contains incorrect or no files when 
> column family has secondary indexes
> ---
>
> Key: CASSANDRA-10968
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10968
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/2i Index
>Reporter: Fred A
>Assignee: Aleksandr Sorokoumov
>Priority: Normal
>  Labels: lhf
> Fix For: 2.2.x, 3.0.x, 3.11.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> xNoticed indeterminate behaviour when taking snapshot on column families that 
> has secondary indexes setup. The created manifest.json created when doing 
> snapshot, sometimes contains no file names at all and sometimes some file 
> names. 
> I don't know if this post is related but that was the only thing I could find:
> http://www.mail-archive.com/user%40cassandra.apache.org/msg42019.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151940#comment-17151940
 ] 

Benedict Elliott Smith commented on CASSANDRA-15922:

There is perhaps a better alternative: use {{addAndGet->if}} instead of 
{{read->if->compareAndSet}}, i.e. unconditionally update the pointer, then 
determine whether or not you successfully allocated in the aftermath.  This is 
guaranteed to succeed in one step; contention can slow that step down modestly, 
but there is no wasted competition.

There is no downside to this approach with the {{NativeAllocator}}, either, 
since if we fail to allocate we always swap the {{Region}}, so consuming more 
than we need when smaller allocations may have been possible is not a problem.  
So we should have made this change a long time ago, really.  

It _might_ be that this approach still sees some slowdown: I assume in this 
case one of the problems is that we are allocating huge numbers of small 
objects, so that a small number of threads are competing over-and-over again to 
allocate the same data.  We should not be competing for each {{Cell}} 
alloation, and instead try to allocate all the buffers for e.g. at least a 
{{Row}} at once.  But this is more involved.  Ideally we would improve the 
allocator itself, which is very under-engineered, but with our threading model 
that's more challenging than we might like.

The _upside_ to this approach is that ordinary workloads should be _improved_, 
and there is no possibility of thread starvation.

The current proposal by contrast introduces much longer windows for thread 
starvation, and _might_ negatively impact tail latencies.  This is a very 
difficult thing for us to rule out, so the work required to demonstrate it is 
performance neutral could be prohibitive.

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}

[jira] [Comment Edited] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151869#comment-17151869
 ] 

Michael Semb Wever edited comment on CASSANDRA-15922 at 7/6/20, 9:43 AM:
-

bq. Although the same change probably needs to be applied to 
org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate as well.

Added to patch.

bq. there's a slight issue in the attached NativeAllocatorRegionTest.java 
Region.allocate() method that adds another CAS (casFailures) to every failed 
CAS against nextFreeOffset. It's probably better to count the number of failed 
CAS's in a local variable and add it to this.casFailures when the test's 
Region.allocate() returns.

Fixed and re-running tests. Thanks [~snazy].
EDIT: new screenshots uploaded. results and conclusions stay the same.


was (Author: michaelsembwever):
bq. Although the same change probably needs to be applied to 
org.apache.cassandra.utils.memory.SlabAllocator.Region#allocate as well.

Added to patch.

bq. there's a slight issue in the attached NativeAllocatorRegionTest.java 
Region.allocate() method that adds another CAS (casFailures) to every failed 
CAS against nextFreeOffset. It's probably better to count the number of failed 
CAS's in a local variable and add it to this.casFailures when the test's 
Region.allocate() returns.

Fixed and re-running tests. Thanks [~snazy].

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied fr

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Attachment: Screen Shot 2020-07-06 at 11.35.35.png
Screen Shot 2020-07-06 at 11.36.44.png

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Attachment: NativeAllocatorRegionTest.java

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: NativeAllocatorRegionTest.java, Screen Shot 2020-07-05 
> at 13.16.10.png, Screen Shot 2020-07-05 at 13.26.17.png, Screen Shot 
> 2020-07-05 at 13.35.55.png, Screen Shot 2020-07-05 at 13.37.01.png, Screen 
> Shot 2020-07-05 at 13.48.16.png, Screen Shot 2020-07-06 at 11.35.35.png, 
> Screen Shot 2020-07-06 at 11.36.44.png, profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant drop, especially at low contention rates.
>  !Screen Shot 2020-

[jira] [Updated] (CASSANDRA-15922) High CAS failures in NativeAllocator.Region.allocate(..)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-15922:
---
Attachment: (was: NativeAllocatorRegionTest.java)

> High CAS failures in NativeAllocator.Region.allocate(..) 
> -
>
> Key: CASSANDRA-15922
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15922
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Memtable
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 4.0, 3.0.x, 3.11.x
>
> Attachments: Screen Shot 2020-07-05 at 13.16.10.png, Screen Shot 
> 2020-07-05 at 13.26.17.png, Screen Shot 2020-07-05 at 13.35.55.png, Screen 
> Shot 2020-07-05 at 13.37.01.png, Screen Shot 2020-07-05 at 13.48.16.png, 
> profile_pbdpc23zafsrh_20200702.svg
>
>
> h4. Problem
> The method {{NativeAllocator.Region.allocate(..)}} uses an {{AtomicInteger}} 
> for the current offset in the region. Allocations depends on a 
> {{.compareAndSet(..)}} call.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node.
> h4. Example
> It has been witnessed up to 33% of CPU time stuck in the 
> {{NativeAllocator.Region.allocate(..)}} loop (due to the CAS failures) during 
> a heavy spark analytics write load.
> These nodes: 40 CPU cores and 256GB ram; have relevant settings
>  - {{memtable_allocation_type: offheap_objects}}
>  - {{memtable_offheap_space_in_mb: 5120}}
>  - {{concurrent_writes: 160}}
> Numerous  flamegraphs demonstrate the problem. See attached 
> [^profile_pbdpc23zafsrh_20200702.svg].
> h4. Suggestion: ThreadLocal Regions
> One possible solution is to have separate Regions per thread.  
> Code wise this is relatively easy to do, for example replacing 
> NativeAllocator:59 
> {code}private final AtomicReference currentRegion = new 
> AtomicReference<>();{code}
> with
> {code}private final ThreadLocal> currentRegion = new 
> ThreadLocal<>() {...};{code}
> But this approach substantially changes the allocation behaviour, with more 
> than concurrent_writes number of Regions in use at any one time. For example 
> with {{concurrent_writes: 160}} that's 160+ regions, each of 1MB. 
> h4. Suggestion: Simple Contention Management Algorithm (Constant Backoff)
> Another possible solution is to introduce a contention management algorithm 
> that a) reduces CAS failures in high contention environments, b) doesn't 
> impact normal environments, and c) keeps the allocation strategy of using one 
> region at a time.
> The research paper [arXiv:1305.5800|https://arxiv.org/abs/1305.5800] 
> describes this contention CAS problem and demonstrates a number of algorithms 
> to apply. The simplest of these algorithms is the Constant Backoff CAS 
> Algorithm.
> Applying the Constant Backoff CAS Algorithm involves adding one line of code 
> to {{NativeAllocator.Region.allocate(..)}} to sleep for one (or some constant 
> number) nanoseconds after a CAS failure occurs.
> That is...
> {code}
> // we raced and lost alloc, try again
> LockSupport.parkNanos(1);
> {code}
> h4. Constant Backoff CAS Algorithm Experiments
> Using the code attached in NativeAllocatorRegionTest.java the concurrency and 
> CAS failures of {{NativeAllocator.Region.allocate(..)}} can be demonstrated. 
> In the attached [^NativeAllocatorRegionTest.java] class, which can be run 
> standalone, the {{Region}} class: copied from {{NativeAllocator.Region}}; has 
> also the {{casFailures}} field added. The following two screenshots are from 
> data collected from this class on a 6 CPU (12 core) MBP, running the 
> {{NativeAllocatorRegionTest.testRegionCAS}} method.
> This attached screenshot shows the number of CAS failures during the life of 
> a Region (over ~215 millions allocations), using different threads and park 
> times. This illustrates the improvement (reduction) of CAS failures from zero 
> park time, through orders of magnitude, up to 1000ns (10ms). The biggest 
> improvement is from no algorithm to a park time of 1ns where CAS failures are 
> ~two orders of magnitude lower. From a park time 10μs and higher there is a 
> significant drop also at low contention rates.
>  !Screen Shot 2020-07-05 at 13.16.10.png|width=500px! 
> This attached screenshot shows the time it takes to fill a Region (~215 
> million allocations), using different threads and park times. The biggest 
> improvement is from no algorithm to a park time of 1ns where performance is 
> one order of magnitude faster. From a park time of 100μs and higher there is 
> a even further significant drop, especially at low contention rates.
>  !Screen Shot 2020-07-05 at 13.26.17.png|width=500px! 
> Repeating the test run show reliably similar results:  [^Screen Sh

[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151899#comment-17151899
 ] 

Michael Semb Wever commented on CASSANDRA-15901:


Just the unit tests (on cassandra13) at 
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch-test/165
Full devbranch pipeline (now that we're touching runtime code) at 
https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/196/

> Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)
> 
>
> Key: CASSANDRA-15901
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15901
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Many of the ci-cassandra jenkins runs fail on {{ip-10-0-5-5: Name or service 
> not known}}. CASSANDRA-15622 addressed some of these but many still remain. 
> Currently test C* nodes are either failing or listening on a public ip 
> depending on which agent they end up.
> The idea behind this ticket is to make ant force the private VPC ip in the 
> cassandra yaml when building, this will force the nodes to listen on the 
> correct ip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151890#comment-17151890
 ] 

Berenguer Blasi commented on CASSANDRA-15901:
-

[~snazy] I did accept your commits but for a single one on wording. Please take 
a look at it and the {{JMXAuthTest}} failure, which is a hostname resolution 
error in some library. Wdyt?

[~mck] could you be so kind to fire a run against cassandra13 please :-)?

I am running a full CI on circle as well. If all goes well that should be it.

> Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)
> 
>
> Key: CASSANDRA-15901
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15901
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-rc
>
>
> Many of the ci-cassandra jenkins runs fail on {{ip-10-0-5-5: Name or service 
> not known}}. CASSANDRA-15622 addressed some of these but many still remain. 
> Currently test C* nodes are either failing or listening on a public ip 
> depending on which agent they end up.
> The idea behind this ticket is to make ant force the private VPC ip in the 
> cassandra yaml when building, this will force the nodes to listen on the 
> correct ip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-15901) Fix unit tests to load test/conf/cassandra.yaml (so to listen on a valid ip)