[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-30 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791933#comment-17791933
 ] 

Berenguer Blasi commented on CASSANDRA-19001:
-

I've been looking at the patch and I don't see how those ttl can come about. I 
checked the dtest code and all lgtm. The dtest even passes locally with both 
the 5.0 and this patch's code. So unless that CI was broken, there was some 
merge/rebase race or sthg esoteric along those lines I don't see anything. I 
would do a new run...

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

Attachment: ci_summary.html
result_details.tar.gz

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

Reviewers: Sam Tunnicliffe, Alex Petrov
   Sam Tunnicliffe, Alex Petrov  (was: Alex Petrov, Sam Tunnicliffe)
   Status: Review In Progress  (was: Patch Available)

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

 Bug Category: Parent values: Code(13163)Level 1 values: Bug - Unclear 
Impact(13164)
   Complexity: Normal
  Component/s: Test/dtest/java
Discovered By: DTest
 Severity: Normal
 Assignee: Alex Petrov
   Status: Open  (was: Triage Needed)

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

Test and Documentation Plan: Original Test Suite
 Status: Patch Available  (was: Open)

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

Status: Open  (was: Patch Available)

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19123) Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19123:

Status: Patch Available  (was: Open)

> Test failure: ConsistentMoveTest#moveTest-cassandra.testtag_IS_UNDEFINED
> 
>
> Key: CASSANDRA-19123
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19123
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout. 
> junit.framework.AssertionFailedError: Timeout occurred. Please note the time 
> in the report does not reflect the time until the timeout. 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) 
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at org.apache.cassandra.anttasks.TestHelper.execute(TestHelper.java:53) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.base/java.util.Vector.forEach(Vector.java:1394) 
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at jdk.internal.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) 
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19083) Remove dependency on bundled Harry jar

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19083:

Test and Documentation Plan: Original test suite 
 Status: Patch Available  (was: Open)

> Remove dependency on bundled Harry jar
> --
>
> Key: CASSANDRA-19083
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19083
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Sam Tunnicliffe
>Assignee: Alex Petrov
>Priority: Urgent
> Fix For: 5.1-alpha1
>
>
> For expediency, we temporarily added a snapshot jar to the source tree, 
> {{lib/harry-core-0.0.2-CASSANDRA-18768.jar}}. We should remove this as soon 
> as the next Harry release is published.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19083) Remove dependency on bundled Harry jar

2023-11-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19083:

Reviewers: Marcus Eriksson, Sam Tunnicliffe
   Status: Review In Progress  (was: Patch Available)

> Remove dependency on bundled Harry jar
> --
>
> Key: CASSANDRA-19083
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19083
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Sam Tunnicliffe
>Assignee: Alex Petrov
>Priority: Urgent
> Fix For: 5.1-alpha1
>
>
> For expediency, we temporarily added a snapshot jar to the source tree, 
> {{lib/harry-core-0.0.2-CASSANDRA-18768.jar}}. We should remove this as soon 
> as the next Harry release is published.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-13560) Improved cleanup performance

2023-11-30 Thread Brian Gallew (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Gallew updated CASSANDRA-13560:
-
Resolution: Abandoned
Status: Resolved  (was: Open)

> Improved cleanup performance
> 
>
> Key: CASSANDRA-13560
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13560
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Compaction
>Reporter: Brian Gallew
>Priority: Normal
>
> I've been thinking about sstables.  One of their properties is that they are 
> sorted.  In the face of that property, it would seem that the cleanup 
> functionality *should* be very fast as all of the partitions which no longer 
> belong to a given node should be in either one or two contiguous blocks of 
> space.  Perhaps this is naive, but I would think the index should clearly 
> indicate what needs to be retained versus what can be disposed of, and thus a 
> cleanup should be able to start reading with the first valid partition, stop 
> with the last, and skip the bulk of loading/unloading that seems to be 
> happening.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19126) Streaming appears to be incompatible with different storage_compatibility_mode settings

2023-11-30 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791914#comment-17791914
 ] 

Berenguer Blasi edited comment on CASSANDRA-19126 at 12/1/23 6:27 AM:
--

Wouldn't nodes in v5.0 stream correctly as they would communicate with the 
correct maximum supported messaging version between the 2? Like in any other 
mixed versions upgrade scenario...

bq. We probably also need storage_compatibility_mode testing somewhere in our 
testing matrix

Branimir is already going to be enabling that setting in CASSANDRA-18753 in CI 
if that is what you're referring to?


was (Author: bereng):
Wouldn't nodes in v5.0 stream correctly as they would communicate with the 
correct maximum supported messaging version between the 2?

bq. We probably also need storage_compatibility_mode testing somewhere in our 
testing matrix

Branimir is already going to be enabling that setting in CASSANDRA-18753 in CI 
if that is what you're referring to?

> Streaming appears to be incompatible with different 
> storage_compatibility_mode settings
> ---
>
> Key: CASSANDRA-19126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Messaging/Internode, Tool/bulk load
>Reporter: Branimir Lambov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> In particular, SSTableLoader appears to be incompatible with 
> storage_compatibility_mode: NONE, which manifests as a failure of 
> {{org.apache.cassandra.distributed.test.SSTableLoaderEncryptionOptionsTest}} 
> when the flag is turned on (found during CASSANDRA-18753 testing). Setting 
> {{storage_compatibility_mode: NONE}} in the tool configuration yaml does not 
> help (according to the docs, this setting is not picked up).
> This is likely a bigger problem as the acceptable streaming version for C* 5 
> is 12 only in legacy mode and 13 only in none, i.e. two C* 5 nodes do not 
> appear to be able to stream with each other if their setting for the 
> compatibility mode is different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19126) Streaming appears to be incompatible with different storage_compatibility_mode settings

2023-11-30 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791914#comment-17791914
 ] 

Berenguer Blasi commented on CASSANDRA-19126:
-

Wouldn't nodes in v5.0 stream correctly as they would communicate with the 
correct maximum supported messaging version between the 2?

bq. We probably also need storage_compatibility_mode testing somewhere in our 
testing matrix

Branimir is already going to be enabling that setting in CASSANDRA-18753 in CI 
if that is what you're referring to?

> Streaming appears to be incompatible with different 
> storage_compatibility_mode settings
> ---
>
> Key: CASSANDRA-19126
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19126
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Streaming, Legacy/Streaming and Messaging, 
> Messaging/Internode, Tool/bulk load
>Reporter: Branimir Lambov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> In particular, SSTableLoader appears to be incompatible with 
> storage_compatibility_mode: NONE, which manifests as a failure of 
> {{org.apache.cassandra.distributed.test.SSTableLoaderEncryptionOptionsTest}} 
> when the flag is turned on (found during CASSANDRA-18753 testing). Setting 
> {{storage_compatibility_mode: NONE}} in the tool configuration yaml does not 
> help (according to the docs, this setting is not picked up).
> This is likely a bigger problem as the acceptable streaming version for C* 5 
> is 12 only in legacy mode and 13 only in none, i.e. two C* 5 nodes do not 
> appear to be able to stream with each other if their setting for the 
> compatibility mode is different.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-18947) Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791912#comment-17791912
 ] 

Berenguer Blasi commented on CASSANDRA-18947:
-

bq. it was a matter of clean code

Yep correct, that's important. But as it is now there's only an extra tiny 
loop, which hardly affects readability imo, but it gives you better 
maintainability. So... pick your posion lol

Thx for all the reviews.

> Test failure: 
> dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
> --
>
> Key: CASSANDRA-18947
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18947
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0-rc
>
>
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/72/testReport/dtest-novnode.disk_balance_test/TestDiskBalance/test_disk_balance_stress/
> h3.  
> {code:java}
> Error Message
> AssertionError: values not within 10.00% of the max: (2534183, 2762123, 
> 2423706) (node1)
> Stacktrace
> self =  def 
> test_disk_balance_stress(self): cluster = self.cluster if 
> self.dtest_config.use_vnodes: 
> cluster.set_configuration_options(values={'num_tokens': 256}) 
> cluster.populate(4).start() node1 = cluster.nodes['node1'] 
> node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
> '-schema', 'replication(factor=3)', 
> 'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)']) 
> cluster.flush() # make sure the data directories are balanced: for node in 
> cluster.nodelist(): > self.assert_balanced(node) disk_balance_test.py:48: _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> disk_balance_test.py:186: in assert_balanced assert_almost_equal(*new_sums, 
> error=0.1, error_message=node.name) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (2534183, 2762123, 2423706) 
> kwargs = {'error': 0.1, 'error_message': 'node1'}, error = 0.1, vmax = 
> 2762123 vmin = 2423706, error_message = 'node1' def 
> assert_almost_equal(*args, **kwargs): """ Assert variable number of arguments 
> all fall within a margin of error. @params *args variable number of numerical 
> arguments to check @params error Optional margin of error. Default 0.16 
> @params error_message Optional error message to print. Default '' Examples: 
> assert_almost_equal(sizes[2], init_size) assert_almost_equal(ttl_session1, 
> ttl_session2[0][0], error=0.005) """ error = kwargs['error'] if 'error' in 
> kwargs else 0.16 vmax = max(args) vmin = min(args) error_message = '' if 
> 'error_message' not in kwargs else kwargs['error_message'] assert vmin > vmax 
> * (1.0 - error) or vmin == vmax, \ > "values not within {:.2f}% of the max: 
> {} ({})".format(error * 100, args, error_message) E AssertionError: values 
> not within 10.00% of the max: (2534183, 2762123, 2423706) (node1) 
> tools/assertions.py:206: AssertionError
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18947) Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread Berenguer Blasi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-18947:

  Since Version: 4.0
Source Control Link: 
https://github.com/apache/cassandra-dtest/commit/365085bbd76ee717e265598fd83c6f4c39e1f1e6
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> Test failure: 
> dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
> --
>
> Key: CASSANDRA-18947
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18947
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0-rc
>
>
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/72/testReport/dtest-novnode.disk_balance_test/TestDiskBalance/test_disk_balance_stress/
> h3.  
> {code:java}
> Error Message
> AssertionError: values not within 10.00% of the max: (2534183, 2762123, 
> 2423706) (node1)
> Stacktrace
> self =  def 
> test_disk_balance_stress(self): cluster = self.cluster if 
> self.dtest_config.use_vnodes: 
> cluster.set_configuration_options(values={'num_tokens': 256}) 
> cluster.populate(4).start() node1 = cluster.nodes['node1'] 
> node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
> '-schema', 'replication(factor=3)', 
> 'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)']) 
> cluster.flush() # make sure the data directories are balanced: for node in 
> cluster.nodelist(): > self.assert_balanced(node) disk_balance_test.py:48: _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> disk_balance_test.py:186: in assert_balanced assert_almost_equal(*new_sums, 
> error=0.1, error_message=node.name) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (2534183, 2762123, 2423706) 
> kwargs = {'error': 0.1, 'error_message': 'node1'}, error = 0.1, vmax = 
> 2762123 vmin = 2423706, error_message = 'node1' def 
> assert_almost_equal(*args, **kwargs): """ Assert variable number of arguments 
> all fall within a margin of error. @params *args variable number of numerical 
> arguments to check @params error Optional margin of error. Default 0.16 
> @params error_message Optional error message to print. Default '' Examples: 
> assert_almost_equal(sizes[2], init_size) assert_almost_equal(ttl_session1, 
> ttl_session2[0][0], error=0.005) """ error = kwargs['error'] if 'error' in 
> kwargs else 0.16 vmax = max(args) vmin = min(args) error_message = '' if 
> 'error_message' not in kwargs else kwargs['error_message'] assert vmin > vmax 
> * (1.0 - error) or vmin == vmax, \ > "values not within {:.2f}% of the max: 
> {} ({})".format(error * 100, args, error_message) E AssertionError: values 
> not within 10.00% of the max: (2534183, 2762123, 2423706) (node1) 
> tools/assertions.py:206: AssertionError
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra-dtest) branch trunk updated: Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread bereng
This is an automated email from the ASF dual-hosted git repository.

bereng pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-dtest.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 365085bb Test failure: 
dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
365085bb is described below

commit 365085bbd76ee717e265598fd83c6f4c39e1f1e6
Author: Bereng 
AuthorDate: Thu Nov 23 08:35:59 2023 +0100

Test failure: 
dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

Patch by Berenguer Blasi; reviewed by Ekaterina Dimitrova, Michael Semb 
Wever for CASSANDRA-18947
---
 disk_balance_test.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/disk_balance_test.py b/disk_balance_test.py
index ceadf98a..1921f1aa 100644
--- a/disk_balance_test.py
+++ b/disk_balance_test.py
@@ -43,6 +43,7 @@ class TestDiskBalance(Tester):
 node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
'-schema', 'replication(factor=3)',
   
'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)'])
 cluster.flush()
+cluster.stop()
 # make sure the data directories are balanced:
 for node in cluster.nodelist():
 self.assert_balanced(node)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18947) Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread Berenguer Blasi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-18947:

Reviewers: Ekaterina Dimitrova, Michael Semb Wever  (was: Michael Semb 
Wever)

> Test failure: 
> dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
> --
>
> Key: CASSANDRA-18947
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18947
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0-rc
>
>
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/72/testReport/dtest-novnode.disk_balance_test/TestDiskBalance/test_disk_balance_stress/
> h3.  
> {code:java}
> Error Message
> AssertionError: values not within 10.00% of the max: (2534183, 2762123, 
> 2423706) (node1)
> Stacktrace
> self =  def 
> test_disk_balance_stress(self): cluster = self.cluster if 
> self.dtest_config.use_vnodes: 
> cluster.set_configuration_options(values={'num_tokens': 256}) 
> cluster.populate(4).start() node1 = cluster.nodes['node1'] 
> node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
> '-schema', 'replication(factor=3)', 
> 'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)']) 
> cluster.flush() # make sure the data directories are balanced: for node in 
> cluster.nodelist(): > self.assert_balanced(node) disk_balance_test.py:48: _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> disk_balance_test.py:186: in assert_balanced assert_almost_equal(*new_sums, 
> error=0.1, error_message=node.name) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (2534183, 2762123, 2423706) 
> kwargs = {'error': 0.1, 'error_message': 'node1'}, error = 0.1, vmax = 
> 2762123 vmin = 2423706, error_message = 'node1' def 
> assert_almost_equal(*args, **kwargs): """ Assert variable number of arguments 
> all fall within a margin of error. @params *args variable number of numerical 
> arguments to check @params error Optional margin of error. Default 0.16 
> @params error_message Optional error message to print. Default '' Examples: 
> assert_almost_equal(sizes[2], init_size) assert_almost_equal(ttl_session1, 
> ttl_session2[0][0], error=0.005) """ error = kwargs['error'] if 'error' in 
> kwargs else 0.16 vmax = max(args) vmin = min(args) error_message = '' if 
> 'error_message' not in kwargs else kwargs['error_message'] assert vmin > vmax 
> * (1.0 - error) or vmin == vmax, \ > "values not within {:.2f}% of the max: 
> {} ({})".format(error * 100, args, error_message) E AssertionError: values 
> not within 10.00% of the max: (2534183, 2762123, 2423706) (node1) 
> tools/assertions.py:206: AssertionError
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-18947) Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread Berenguer Blasi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-18947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Berenguer Blasi updated CASSANDRA-18947:

Status: Ready to Commit  (was: Review In Progress)

> Test failure: 
> dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
> --
>
> Key: CASSANDRA-18947
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18947
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0-rc
>
>
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/72/testReport/dtest-novnode.disk_balance_test/TestDiskBalance/test_disk_balance_stress/
> h3.  
> {code:java}
> Error Message
> AssertionError: values not within 10.00% of the max: (2534183, 2762123, 
> 2423706) (node1)
> Stacktrace
> self =  def 
> test_disk_balance_stress(self): cluster = self.cluster if 
> self.dtest_config.use_vnodes: 
> cluster.set_configuration_options(values={'num_tokens': 256}) 
> cluster.populate(4).start() node1 = cluster.nodes['node1'] 
> node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
> '-schema', 'replication(factor=3)', 
> 'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)']) 
> cluster.flush() # make sure the data directories are balanced: for node in 
> cluster.nodelist(): > self.assert_balanced(node) disk_balance_test.py:48: _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> disk_balance_test.py:186: in assert_balanced assert_almost_equal(*new_sums, 
> error=0.1, error_message=node.name) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (2534183, 2762123, 2423706) 
> kwargs = {'error': 0.1, 'error_message': 'node1'}, error = 0.1, vmax = 
> 2762123 vmin = 2423706, error_message = 'node1' def 
> assert_almost_equal(*args, **kwargs): """ Assert variable number of arguments 
> all fall within a margin of error. @params *args variable number of numerical 
> arguments to check @params error Optional margin of error. Default 0.16 
> @params error_message Optional error message to print. Default '' Examples: 
> assert_almost_equal(sizes[2], init_size) assert_almost_equal(ttl_session1, 
> ttl_session2[0][0], error=0.005) """ error = kwargs['error'] if 'error' in 
> kwargs else 0.16 vmax = max(args) vmin = min(args) error_message = '' if 
> 'error_message' not in kwargs else kwargs['error_message'] assert vmin > vmax 
> * (1.0 - error) or vmin == vmax, \ > "values not within {:.2f}% of the max: 
> {} ({})".format(error * 100, args, error_message) E AssertionError: values 
> not within 10.00% of the max: (2534183, 2762123, 2423706) (node1) 
> tools/assertions.py:206: AssertionError
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra-website) branch asf-staging updated (de724b94c -> 6fb0e7abe)

2023-11-30 Thread git-site-role
This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


 discard de724b94c generate docs for 119ea2c4
 new 6fb0e7abe generate docs for 119ea2c4

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (de724b94c)
\
 N -- N -- N   refs/heads/asf-staging (6fb0e7abe)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/search-index.js |   2 +-
 site-ui/build/ui-bundle.zip | Bin 4883726 -> 4883726 bytes
 2 files changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra-website) branch asf-staging updated (4ac612577 -> de724b94c)

2023-11-30 Thread git-site-role
This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


 discard 4ac612577 generate docs for 119ea2c4
 new de724b94c generate docs for 119ea2c4

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (4ac612577)
\
 N -- N -- N   refs/heads/asf-staging (de724b94c)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 site-ui/build/ui-bundle.zip | Bin 4883726 -> 4883726 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791854#comment-17791854
 ] 

Runtian Liu commented on CASSANDRA-19120:
-

The Draft looks good, I can create a dtest for this case.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791853#comment-17791853
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

I am running a build here (1). Let's see what tests will fail on this to have 
an idea what we are against. 

(1) 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/3603/workflows/3f5ec3b4-62c3-40e9-8734-88cf55d86327

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791846#comment-17791846
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

The draft looks good to me. Will let [~curlylrt] also take a look at it 
Besides, we should add a JVM dtest or unit test for this fix. If you want 
[~curlylrt] to take care of it, then let us know, and we can write one.

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791845#comment-17791845
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19001:
-

Thanks, [~paulo], appreciate it!
In the meantime I looked into the CI run in more detail:
- test_bootstrap_and_cleanup - this seems to need investigation; seems to have 
been seen recently - CASSANDRA-18660. Actually in the other ticket the wait 
time for the message was 120 seconds, here we see it being 90. I do not believe 
this is related to what we do here, but seems worth it to open ticket and 
investigate. 
- test_decommissioned_wiped_node_can_join and 
test_shutdown_wiped_node_cannot_join failed with similar errors. And they also 
do not seem as something that could be triggered by this patch. I will open 
tickets and bisect. 

{code:java}

assert [Row(key=b'PP...e9\xbb'), ...] == [Row(key=b'5O...eL\xb6'), ...]
  At index 0 diff: Row(key=b'PP9O0M9170', 
C0=b'|\xa4\xc7\xb99\xd4\xae*\x85\x8c\xb3T\xa6!\x15\xaa{u\x90Rz\xc7J\x9a\xdd\x97b\xdd-\x07|\x06*\x06',
 C1=b"\x95g\x0f\xa2\x13Bha\xefW'\xf9\x
{code}
- test_move_single_node - seems unrelated and deserving a ticket:

{code:java}
failed on teardown with "Failed: Unexpected error found in node logs (see 
stdout for full details). Errors: [[node1] 'ERROR [main] 2023-11-30 
00:02:24,154 CassandraDaemon.java:877 - Fatal configuration 
error\norg.apache.cassandra.exceptions.ConfigurationException: Bootstrapping to 
existing token 0 is not allowed (decommission/removenode the old node 
first).\n\tat 
org.apache.cassandra.dht.BootStrapper.getSpecifiedTokens(BootStrapper.java:199)\n\tat
 
org.apache.cassandra.dht.BootStrapper.getBootstrapTokens(BootStrapper.java:167)\n\tat
 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1256)\n\tat
 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1209)\n\tat
 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:988)\n\tat
 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:905)\n\tat
 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:377)\n\tat
 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:721)\n\tat
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:855)']"
Unexpected error found in node logs (see stdout for full details). Errors: 
[[node1] 'ERROR [main] 2023-11-30 00:02:24,154 CassandraDaemon.java:877 - Fatal 
configuration error\norg.apache.cassandra.exceptions.ConfigurationException: 
Bootstrapping to existing token 0 is not allowed (decommission/removenode the 
old node first).\n\tat 
org.apache.cassandra.dht.BootStrapper.getSpecifiedTokens(BootStrapper.java:199)\n\tat
 
org.apache.cassandra.dht.BootStrapper.getBootstrapTokens(BootStrapper.java:167)\n\tat
 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1256)\n\tat
 
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:1209)\n\tat
 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:988)\n\tat
 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:905)\n\tat
 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:377)\n\tat
 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:721)\n\tat
 org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:855)']

{code}
- test_expiration_overflow_policy_cap_default_ttl - unrelated, seems to deserve 
a ticket, CC [~Bereng], might be interested into the error we see (there is 
more in the logs of course, but this is outstanding to me):

{code:java}
>   raise self._final_exception
E   cassandra.InvalidRequest: Error from server: code=2200 [Invalid 
query] message="Request on table ks.ttl_table with default ttl of 63072 
seconds exceeds maximum supported expiration date of 2038-01-19T03:14:06+00:00. 
In order to avoid this use a lower TTL, change the expiration date overflow 
policy or upgrade to a version where this limitation is fixed. See 
CASSANDRA-14092 and CASSANDRA-14227 for more details."
{code}

The rest seem to be either OS errors or timeouts. On rebase at the end I will 
run new full CI before commit and it will be confirmed


> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-r

Re: [PR] Cassandra 18852: Make bulk writer resilient to cluster resize events [cassandra-analytics]

2023-11-30 Thread via GitHub


arjunashok commented on code in PR #17:
URL: 
https://github.com/apache/cassandra-analytics/pull/17#discussion_r1411436951


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/RecordWriter.java:
##
@@ -136,35 +185,122 @@ public StreamResult write(Iterator> sourceIterato
 }
 }
 
+private Map, List> 
taskTokenRangeMapping(TokenRangeMapping tokenRange,
+ 
Range taskTokenRange)
+{
+return tokenRange.getSubRanges(taskTokenRange).asMapOfRanges();
+}
+
+private Set instancesFromMapping(Map, 
List> mapping)
+{
+return mapping.values()
+  .stream()
+  .flatMap(Collection::stream)
+  .collect(Collectors.toSet());
+}
+
+/**
+ * Creates a new session if we have the current token range intersecting 
the ranges from write replica-set.
+ * If we do find the need to split a range into sub-ranges, we create the 
corresponding session for the sub-range
+ * if the token from the row data belongs to the range.
+ */
+private StreamSession maybeCreateStreamSession(TaskContext taskContext,
+   StreamSession streamSession,
+   Tuple2 rowData,
+   Set> 
newRanges,
+   
ReplicaAwareFailureHandler failureHandler) throws IOException
+{
+BigInteger token = rowData._1().getToken();
+Range tokenRange = getTokenRange(taskContext);
+
+Preconditions.checkState(tokenRange.contains(token),
+ String.format("Received Token %s outside of 
expected range %s", token, tokenRange));
+
+// token range for this partition is not among the write-replica-set 
ranges
+if (!newRanges.contains(tokenRange))

Review Comment:
   Makes sense. Pulled this out of the loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] Cassandra 18852: Make bulk writer resilient to cluster resize events [cassandra-analytics]

2023-11-30 Thread via GitHub


arjunashok commented on code in PR #17:
URL: 
https://github.com/apache/cassandra-analytics/pull/17#discussion_r1411436190


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/RecordWriter.java:
##
@@ -110,20 +132,47 @@ public StreamResult write(Iterator> sourceIterato
 Map valueMap = new HashMap<>();
 try
 {
+List exclusions = 
failureHandler.getFailedInstances();
+Set> newRanges = 
initialTokenRangeMapping.getRangeMap().asMapOfRanges().entrySet()
+   
.stream()
+   
.filter(e -> !exclusions.contains(e.getValue()))
+   
.map(Map.Entry::getKey)
+   
.collect(Collectors.toSet());
+
 while (dataIterator.hasNext())
 {
+Tuple2 rowData = dataIterator.next();
+streamSession = maybeCreateStreamSession(taskContext, 
streamSession, rowData, newRanges, failureHandler);
+
+sessions.add(streamSession);
 maybeCreateTableWriter(partitionId, baseDir);
-writeRow(valueMap, dataIterator, partitionId, range);
+writeRow(rowData, valueMap, partitionId, 
streamSession.getTokenRange());
 checkBatchSize(streamSession, partitionId, job);
 }
 
-if (sstableWriter != null)
+// Finalize SSTable for the last StreamSession
+if (sstableWriter != null || (streamSession != null && batchSize 
!= 0))

Review Comment:
   Good point, removing redundant checks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] Cassandra 18852: Make bulk writer resilient to cluster resize events [cassandra-analytics]

2023-11-30 Thread via GitHub


arjunashok commented on code in PR #17:
URL: 
https://github.com/apache/cassandra-analytics/pull/17#discussion_r1411436061


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/RecordWriter.java:
##
@@ -110,20 +132,47 @@ public StreamResult write(Iterator> sourceIterato
 Map valueMap = new HashMap<>();
 try
 {
+List exclusions = 
failureHandler.getFailedInstances();
+Set> newRanges = 
initialTokenRangeMapping.getRangeMap().asMapOfRanges().entrySet()
+   
.stream()
+   
.filter(e -> !exclusions.contains(e.getValue()))

Review Comment:
   Good catch. Evaluating if we even need this filter anymore. Will update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791840#comment-17791840
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Draft updated.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791837#comment-17791837
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

Sounds good [~smiklosovic] 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791834#comment-17791834
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

[~chovatia.jayd...@gmail.com]

I think we should wrap it in "if" checking if we are on LOCAL_QUORUM because if 
we do that as I suggested, then it will be preferring local replicas when we 
are on QUORUM too and I do not think that is desirable. Preferring local 
replicas when we are on LOCAL_QUORUM is probably just fine. 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791832#comment-17791832
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

[~smiklosovic] , that works for us!

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791831#comment-17791831
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

my idea is more like this

https://github.com/apache/cassandra/pull/2953/files#diff-5f7abad30251655d9a9ba867a9d98d63e915941bf1b5acfe3b1def30a4137cc2

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791827#comment-17791827
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 11:34 PM:
--

yeah, we should not completely filter the remote nodes with LOCAL_QUORUM as it 
would lead to other issues. We should simply *prioritize* the local nodes over 
remote nodes. Something like this can be done
{code:java}
                 if (consistencyLevel != EACH_QUORUM)
                 {
                     int add = 
consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
+
+                    if (consistencyLevel == LOCAL_QUORUM)
+                    {
+                        // prioritize local replicas first
+                        for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r) && InOurDcTester.replicas().test(r)))
+                        {
+                            contacts.add(replica);
+                            if (--add == 0)
+                                break;
+                        }
+                    }
                     if (add > 0)
                     {
                         for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
.
. {code}


was (Author: chovatia.jayd...@gmail.com):
yeah, we should not completely filter the remote nodes with LOCAL_QUORUM as it 
would lead to other issues. We should simply *prioritize* the local nodes over 
remote nodes. Something like this can be done
{code:java}
                 if (consistencyLevel != EACH_QUORUM)
                 {
                     int add = 
consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
+
+                    if (consistencyLevel == LOCAL_QUORUM)
+                    {
+                        // prioritize local replicas first
+                        for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r) && InOurDcTester.replicas().test(r)))
+                        {
+                            contacts.add(replica);
+                            if (--add == 0)
+                                break;
+                        }
+                    }
                     if (add > 0)
                     {
                         for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
 {code}

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791827#comment-17791827
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 11:32 PM:
--

yeah, we should not completely filter the remote nodes with LOCAL_QUORUM as it 
would lead to other issues. We should simply *prioritize* the local nodes over 
remote nodes. Something like this can be done
{code:java}
                 if (consistencyLevel != EACH_QUORUM)
                 {
                     int add = 
consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
+
+                    if (consistencyLevel == LOCAL_QUORUM)
+                    {
+                        // prioritize local replicas first
+                        for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r) && InOurDcTester.replicas().test(r)))
+                        {
+                            contacts.add(replica);
+                            if (--add == 0)
+                                break;
+                        }
+                    }
                     if (add > 0)
                     {
                         for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
 {code}


was (Author: chovatia.jayd...@gmail.com):
yeah, we should not completely filter the remote nodes with LOCAL_QUORUM as it 
would lead to other issues. We should simply *prioritize* the local node over 
remote nodes. Something like this can be done
{code:java}
                 if (consistencyLevel != EACH_QUORUM)
                 {
                     int add = 
consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
+
+                    if (consistencyLevel == LOCAL_QUORUM)
+                    {
+                        // prioritize local replicas first
+                        for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r) && InOurDcTester.replicas().test(r)))
+                        {
+                            contacts.add(replica);
+                            if (--add == 0)
+                                break;
+                        }
+                    }
                     if (add > 0)
                     {
                         for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
 {code}

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://git

[jira] [Commented] (CASSANDRA-19001) Check whether the startup warnings for unknown modules represent a legit problem or cosmetic issue

2023-11-30 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791828#comment-17791828
 ] 

Paulo Motta commented on CASSANDRA-19001:
-

[~e.dimitrova] thanks for the patch! I'll take a look ASAP, hopefully tomorrow.

> Check whether the startup warnings for unknown modules represent a legit 
> problem or cosmetic issue
> --
>
> Key: CASSANDRA-19001
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19001
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Ekaterina Dimitrova
>Assignee: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.0-rc, 5.0.x, 5.x
>
>
> During the 5.0 alpha 2 release 
> [vote|https://lists.apache.org/thread/lt3x0obr5cpbcydf5490pj6b2q0mz5zr], 
> [~paulo] raised the following concerns:
> {code:java}
> Launched a tarball-based 5.0-alpha2 container on top of
> "eclipse-temurin:17-jre-focal" and the server starts up fine, can run
> nodetool and cqlsh.
> I got these seemingly harmless JDK17 warnings during startup and when
> running nodetool (no warnings on JDK11):
> WARNING: Unknown module: jdk.attach specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-exports
> WARNING: Unknown module: jdk.compiler specified to --add-opens
> WARNING: A terminally deprecated method in java.lang.System has been called
> WARNING: System::setSecurityManager has been called by
> org.apache.cassandra.security.ThreadAwareSecurityManager
> (file:/opt/cassandra/lib/apache-cassandra-5.0-alpha2-SNAPSHOT.jar)
> WARNING: Please consider reporting this to the maintainers of
> org.apache.cassandra.security.ThreadAwareSecurityManager
> WARNING: System::setSecurityManager will be removed in a future release
> Anybody knows if these warnings are legit/expected ? We can create
> follow-up tickets if needed.
> $ java --version
> openjdk 17.0.9 2023-10-17
> OpenJDK Runtime Environment Temurin-17.0.9+9 (build 17.0.9+9)
> OpenJDK 64-Bit Server VM Temurin-17.0.9+9 (build 17.0.9+9, mixed mode,
> sharing)
> {code}
> {code:java}
> Clarification: - When running nodetool only the "Unknown module" warnings 
> show up. All warnings show up during startup.{code}
> We need to verify whether this presents a real problem in the features where 
> those modules are expected to be used, or if it is a false alarm. 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791827#comment-17791827
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

yeah, we should not completely filter the remote nodes with LOCAL_QUORUM as it 
would lead to other issues. We should simply *prioritize* the local node over 
remote nodes. Something like this can be done
{code:java}
                 if (consistencyLevel != EACH_QUORUM)
                 {
                     int add = 
consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
+
+                    if (consistencyLevel == LOCAL_QUORUM)
+                    {
+                        // prioritize local replicas first
+                        for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r) && InOurDcTester.replicas().test(r)))
+                        {
+                            contacts.add(replica);
+                            if (--add == 0)
+                                break;
+                        }
+                    }
                     if (add > 0)
                     {
                         for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
 {code}

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



Re: [PR] CASSANDRA-19024 Fix bulk reading when using identifiers that need quotes [cassandra-analytics]

2023-11-30 Thread via GitHub


frankgh commented on code in PR #19:
URL: 
https://github.com/apache/cassandra-analytics/pull/19#discussion_r1411410934


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/data/CassandraDataLayer.java:
##
@@ -950,204 +973,7 @@ protected void await(CountDownLatch latch)
 }
 }
 
-public static final class ClientConfig

Review Comment:
   moved this to its own class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19054) WEBSITE - Add Cassandra Catalyst page and blog

2023-11-30 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19054:
---
  Fix Version/s: NA
Source Control Link: 
https://github.com/apache/cassandra-website/commit/119ea2c4fbf5360b2cbed8b0c5c6790a5e3fec73
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> WEBSITE - Add Cassandra Catalyst page and blog
> --
>
> Key: CASSANDRA-19054
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19054
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation/Blog, Documentation/Website
>Reporter: Diogenese Topper
>Priority: Normal
>  Labels: pull-request-available
> Fix For: NA
>
> Attachments: blogindex.png, navicon.png, 
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___blog_Introducing-the-Apache-Cassandra-Catalyst-Program.html.png,
>  
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___cassandra-catalyst-program.html.png
>
>
> This ticket is to capture the work associated with creating a new page for 
> the Cassandra Catalyst program and it's associated blog.
>  
> Preferably, the blog and page are live as of *November 27.* Please contact 
> me, suggest changes, or correct the date when possible in the pull request 
> for the appropriate time that the blog will go live (on both the blog.adoc 
> and the blog post's file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19054) WEBSITE - Add Cassandra Catalyst page and blog

2023-11-30 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791824#comment-17791824
 ] 

Michael Semb Wever commented on CASSANDRA-19054:


Committed 
https://github.com/apache/cassandra-website/commit/119ea2c4fbf5360b2cbed8b0c5c6790a5e3fec73
(will push live tomorrow)

> WEBSITE - Add Cassandra Catalyst page and blog
> --
>
> Key: CASSANDRA-19054
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19054
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation/Blog, Documentation/Website
>Reporter: Diogenese Topper
>Priority: Normal
>  Labels: pull-request-available
> Attachments: blogindex.png, navicon.png, 
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___blog_Introducing-the-Apache-Cassandra-Catalyst-Program.html.png,
>  
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___cassandra-catalyst-program.html.png
>
>
> This ticket is to capture the work associated with creating a new page for 
> the Cassandra Catalyst program and it's associated blog.
>  
> Preferably, the blog and page are live as of *November 27.* Please contact 
> me, suggest changes, or correct the date when possible in the pull request 
> for the appropriate time that the blog will go live (on both the blog.adoc 
> and the blog post's file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra-website) branch trunk updated: Adding Catalyst page and blog post

2023-11-30 Thread mck
This is an automated email from the ASF dual-hosted git repository.

mck pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


The following commit(s) were added to refs/heads/trunk by this push:
 new 119ea2c4f Adding Catalyst page and blog post
119ea2c4f is described below

commit 119ea2c4fbf5360b2cbed8b0c5c6790a5e3fec73
Author: Paul Thomas Au 
AuthorDate: Wed Nov 22 16:37:20 2023 -0800

Adding Catalyst page and blog post

 patch by Paul Thomas Au, Diogenese Topper, Melissa Logan; reviewed by Mick 
Semb Wever, Paulo Motta, Josh McKenzie for CASSANDRA-19054
---
 .../modules/ROOT/images/sub-menu-catalyst.png  | Bin 0 -> 4401 bytes
 site-content/source/modules/ROOT/pages/blog.adoc   |  22 
 ...cing-the-Apache-Cassandra-Catalyst-Program.adoc |  38 ++
 .../ROOT/pages/cassandra-catalyst-program.adoc | 137 +
 site-ui/build/ui-bundle.zip| Bin 4881412 -> 4883726 
bytes
 site-ui/src/css/tt_styles.css  |   4 +
 site-ui/src/img/sub-menu-catalyst.png  | Bin 0 -> 4401 bytes
 site-ui/src/partials/header-nav.hbs|  10 ++
 8 files changed, 211 insertions(+)

diff --git a/site-content/source/modules/ROOT/images/sub-menu-catalyst.png 
b/site-content/source/modules/ROOT/images/sub-menu-catalyst.png
new file mode 100644
index 0..6f10214b8
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/sub-menu-catalyst.png differ
diff --git a/site-content/source/modules/ROOT/pages/blog.adoc 
b/site-content/source/modules/ROOT/pages/blog.adoc
index fe9e499d7..1869ba195 100644
--- a/site-content/source/modules/ROOT/pages/blog.adoc
+++ b/site-content/source/modules/ROOT/pages/blog.adoc
@@ -8,6 +8,28 @@ NOTES FOR CONTENT CREATORS
 - Replace post tile, date, description and link to you post.
 
 
+//start card
+[openblock,card shadow relative test]
+
+[openblock,card-header]
+--
+[discrete]
+=== Introducing the Apache Cassandra® Catalyst Program
+[discrete]
+ December 1, 2023
+--
+[openblock,card-content]
+--
+Announcing the brand new Apache Cassandra® Catalyst Program, the first of it's 
kind!
+[openblock,card-btn card-btn--blog]
+
+[.btn.btn--alt]
+xref:blog/Introducing-the-Apache-Cassandra-Catalyst-Program.adoc[Read More]
+
+
+--
+
+//end card
 
 //start card
 [openblock,card shadow relative test]
diff --git 
a/site-content/source/modules/ROOT/pages/blog/Introducing-the-Apache-Cassandra-Catalyst-Program.adoc
 
b/site-content/source/modules/ROOT/pages/blog/Introducing-the-Apache-Cassandra-Catalyst-Program.adoc
new file mode 100644
index 0..2b460cc6c
--- /dev/null
+++ 
b/site-content/source/modules/ROOT/pages/blog/Introducing-the-Apache-Cassandra-Catalyst-Program.adoc
@@ -0,0 +1,38 @@
+= Introducing the Apache Cassandra® Catalyst Program
+:page-layout: single-post
+:page-role: blog-post
+:page-post-date: December 1, 2023
+:page-post-author: Apache Cassandra PMC
+:description: announcement of the Apache Cassandra Catalyst program
+:keywords: 
+
+One of the cornerstones of https://www.apache.org/theapacheway/[The Apache 
Way^] is "community over code," the belief that the most sustainable and 
healthy projects are those that value a diverse and collaborative community. By 
working together, strong communities can rectify problems that arise during 
code development and can better evolve a project to meet technology demands. 
+
+Today we are excited to announce the 
xref:cassandra-catalyst-program.adoc[Apache Cassandra Catalyst program], an 
effort that aims to recognize individuals who invest in the growth of the 
Apache Cassandra community by enthusiastically sharing their expertise, 
encouraging participation, and creating a welcoming environment. This is the 
first PMC-led community program of its kind within the Apache Software 
Foundation (ASF) ecosystem, and we hope it inspires other ASF projects to find 
simila [...]
+
+Below you’ll find a couple of questions we feel are important to address right 
away, but for more information, you can visit the 
xref:cassandra-catalyst-program.adoc[Cassandra Catalyst Program page] and 
https://docs.google.com/forms/d/e/1FAIpQLScQ6FJZ9Z6Jpym0q1KUXzjnzEsHJvsmjQ3R6KEs6Qs8Jg7W_Q/viewform[nominate
 someone or apply^].
+
+**What does it mean to be a Cassandra Catalyst?** 
+
+Catalysts are trustworthy, expert contributors with a passion for connecting 
and empowering others with Cassandra knowledge. The individuals must be able to 
demonstrate strong knowledge of Cassandra such as production deployments, 
educational material, conference talks, or other ways. In broad terms, Catalyst 
can participate in two groups of activities:
+
+* **Contribution**: Engaging with the project and community in a myriad of 
ways including responding to community questions, welcoming new members; or 
engaging in JIRA tickets. 
+* **Promotion**: Telling others about Cassandra

[jira] [Updated] (CASSANDRA-19054) WEBSITE - Add Cassandra Catalyst page and blog

2023-11-30 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19054:
---
Status: Ready to Commit  (was: Review In Progress)

> WEBSITE - Add Cassandra Catalyst page and blog
> --
>
> Key: CASSANDRA-19054
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19054
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation/Blog, Documentation/Website
>Reporter: Diogenese Topper
>Priority: Normal
>  Labels: pull-request-available
> Attachments: blogindex.png, navicon.png, 
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___blog_Introducing-the-Apache-Cassandra-Catalyst-Program.html.png,
>  
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___cassandra-catalyst-program.html.png
>
>
> This ticket is to capture the work associated with creating a new page for 
> the Cassandra Catalyst program and it's associated blog.
>  
> Preferably, the blog and page are live as of *November 27.* Please contact 
> me, suggest changes, or correct the date when possible in the pull request 
> for the appropriate time that the blog will go live (on both the blog.adoc 
> and the blog post's file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19054) WEBSITE - Add Cassandra Catalyst page and blog

2023-11-30 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-19054:
---
Reviewers: Michael Semb Wever
   Status: Review In Progress  (was: Patch Available)

> WEBSITE - Add Cassandra Catalyst page and blog
> --
>
> Key: CASSANDRA-19054
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19054
> Project: Cassandra
>  Issue Type: Task
>  Components: Documentation/Blog, Documentation/Website
>Reporter: Diogenese Topper
>Priority: Normal
>  Labels: pull-request-available
> Attachments: blogindex.png, navicon.png, 
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___blog_Introducing-the-Apache-Cassandra-Catalyst-Program.html.png,
>  
> raw.githack.com_Paul-TT_cassandra-website_CASSANDRA-19054_generated_content___cassandra-catalyst-program.html.png
>
>
> This ticket is to capture the work associated with creating a new page for 
> the Cassandra Catalyst program and it's associated blog.
>  
> Preferably, the blog and page are live as of *November 27.* Please contact 
> me, suggest changes, or correct the date when possible in the pull request 
> for the appropriate time that the blog will go live (on both the blog.adoc 
> and the blog post's file).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791821#comment-17791821
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 11:13 PM:
--

Well that is not so simple. Yeah as you said, we might prioritize but we should 
not _exclude them_. E.g. if all live nodes are remote and we are on 
LOCAL_QUORUM then by what I just sent we would never add them.

I ll play with this and let you know :)


was (Author: smiklosovic):
Well that is not so simple. Yeah as you said, we might prioritize but we should 
not _exclude them_. E.g. if all live nodes are remote and we are on 
LOCAL_QUORUM then by what I just sent we would never add them.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791821#comment-17791821
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 11:12 PM:
--

Well that is not so simple. Yeah as you said, we might prioritize but we should 
not _exclude them_. E.g. if all live nodes are remote and we are on 
LOCAL_QUORUM then by what I just sent we would never add them.


was (Author: smiklosovic):
Well that is not so simple. Yeah as you siad, we might prioritize but we should 
not _exclude them_.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791821#comment-17791821
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Well that is not so simple. Yeah as you siad, we might prioritize but we should 
not _exclude them_.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791820#comment-17791820
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

I see ... well, I dont know ... something like this?

https://github.com/apache/cassandra/pull/2953/commits/06069fac4f73882115c325722c2cc7bb9b779080

I dont see anything wrong with that special case but we are changing the 
behavior here and we should be careful. Let's test this idea first and then we 
can invite another committer to evaluate it as well.



> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791818#comment-17791818
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 11:00 PM:
--

For {_}QUORUM{_}, it is okay to have remote nodes, but for {_}LOCAL_QUORUM{_}, 
it is better to prioritize local nodes over remote nodes. We can extend the 
selector code [1] and add a special case for LOCAL_QUORUM to prioritize local 
node selection first because in a multi-region setup, LOCAL_QUORUM is the most 
commonly used replication factor.

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]

 


was (Author: chovatia.jayd...@gmail.com):
For //{_}QUORUM{_}, it is okay to have remote nodes, but for 
{_}LOCAL_QUORUM{_}, it is better to prioritize local nodes over remote nodes. 
We can extend the selector code [1] and add a special case for LOCAL_QUORUM to 
prioritize local node selection first because in a multi-region setup, 
LOCAL_QUORUM is the most commonly used replication factor.

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791818#comment-17791818
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:58 PM:
--

For //{_}QUORUM{_}, it is okay to have remote nodes, but for 
{_}LOCAL_QUORUM{_}, it is better to prioritize local nodes over remote nodes. 
We can extend the selector code [1] and add a special case for LOCAL_QUORUM to 
prioritize local node selection first because in a multi-region setup, 
LOCAL_QUORUM is the most commonly used replication factor.

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]

 


was (Author: chovatia.jayd...@gmail.com):
For //QUORUM//, it is okay to have remote nodes, but for LOCAL_QUORUM, it is 
better to prioritize local nodes over remote nodes. We can extend the selector 
code [1] and add a special case for LOCAL_QUORUM to prioritize local node 
selection first because in a multi-region setup, LOCAL_QUORUM is the most 
commonly used replication factor.

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791818#comment-17791818
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

For //QUORUM//, it is okay to have remote nodes, but for LOCAL_QUORUM, it is 
better to prioritize local nodes over remote nodes. We can extend the selector 
code [1] and add a special case for LOCAL_QUORUM to prioritize local node 
selection first because in a multi-region setup, LOCAL_QUORUM is the most 
commonly used replication factor.

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791816#comment-17791816
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 10:55 PM:
--

Whole research we did was with LOCAL_QUORUM in mind. If we are not on 
EACH_QUORUM (1), then the selector logic will be also invoked when we are on 
QUORUM, right? Then, selecting replicas in remote DCs to participate in read 
repair is perfectly fine, no? We just should not block for them in 
BlockingPartitionRepair. 

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L570


was (Author: smiklosovic):
Whole research we did was with LOCAL_QUORUM in mind. If we are not on 
EACH_QUORUM (1), then the selector logic will be also invoked when we are on 
QUORUM, right? Then, selecting replicas in remote DCs to participate in read 
repair is perfectly fine, no? We just should not block from them in 
BlockingPartitionRepair. 

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L570

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791816#comment-17791816
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Whole research we did was with LOCAL_QUORUM in mind. If we are not on 
EACH_QUORUM (1), then the selector logic will be also invoked when we are on 
QUORUM, right? Then, selecting replicas in remote DCs to participate in read 
repair is perfectly fine, no? We just should not block from them in 
BlockingPartitionRepair. 

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L570

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791814#comment-17791814
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:50 PM:
--

[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL{_}*. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas [1] - 
{color:#de350b}*must*{color}
 # Improve the selector not to select the remote replicas in the first place 
[2] - *{color:#ff8b00}nice to have{color}*

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 


was (Author: chovatia.jayd...@gmail.com):
[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL{_}*. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1] - 
{color:#de350b}*must*{color}
 # Improve the selector not to select the remote replicas in the first place 
[2]- *{color:#ff8b00}nice to have{color}*

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791814#comment-17791814
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:48 PM:
--

[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL{_}*. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 


was (Author: chovatia.jayd...@gmail.com):
[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL_*{_}. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791814#comment-17791814
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:48 PM:
--

[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL{_}*. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1] - 
{color:#de350b}*must*{color}
 # Improve the selector not to select the remote replicas in the first place 
[2]- *{color:#ff8b00}nice to have{color}*

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 


was (Author: chovatia.jayd...@gmail.com):
[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL{_}*. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

--

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791814#comment-17791814
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:48 PM:
--

[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL_*{_}. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 


was (Author: chovatia.jayd...@gmail.com):
[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL_{_}. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.

[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791814#comment-17791814
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

[~smiklosovic]

 

[~curlylrt] and I are also suggesting enhancing the _Selector_ [2] code and not 
selecting the remote replicas in the first place if the user-specified 
consistency level is {_}LOCAL_{_}. So, in short, two fixes we are recommending:
 # _adjustedBlockFor_ should count only the local replicas in [1]
 # Improve the selector not to select the remote replicas in the first place [2]

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]

[2][https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]

 

wdyt?

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791813#comment-17791813
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

[~chovatia.jayd...@gmail.com] cool I will run some tests here. Maybe 
[~curlylrt] could test this on his end in the meanwhile to verify our logic?

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791812#comment-17791812
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 10:35 PM:
--

Whole logic will "end" when latch is 0. Correct? if "from"  in ack() is remote, 
then it will not enter that "if" so it will not be decremented. But that is all 
fine, as long as (adjusted)BlockFor contains only blocks for local replicas. 
Basically we need to set blockFor low enough to get to 0 in ack(). That will be 
done when we negate shouldBlockFor method in that "if" to exclude remote 
replicas because remote replicas are not meant to be waited for.


was (Author: smiklosovic):
Whole logic will "end" when latch is 0. Correct? if "from" is remote, then it 
will not enter that "if" so it will not be decremented. But that is all fine, 
as long as (adjusted)BlockFor contains only blocks for local replicas. 
Basically we need to set blockFor low enough to get to 0 in ack(). That will be 
done when we negate shouldBlockFor method in that "if" to exclude remote 
replicas because remote replicas are not meant to be waited for.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791812#comment-17791812
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Whole logic will "end" when latch is 0. Correct? if "from" is remote, then it 
will not enter that "if" so it will not be decremented. But that is all fine, 
as long as (adjusted)BlockFor contains only blocks for local replicas. 
Basically we need to set blockFor low enough to get to 0 in ack(). That will be 
done when we negate shouldBlockFor method in that "if" to exclude remote 
replicas because remote replicas are not meant to be waited for.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791811#comment-17791811
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

[~smiklosovic] 

> So it seems to me that we should do this:

The fix you recommended won't work because if the remote node is a participant, 
the condition will always be _false_ because of the _`&&`_ operator in the _if_ 
loop. It should be an {_}`||{_}` operator. A slight modification will work:
{code:java}
if (!repairs.containsKey(participant) || 
!shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;{code}

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791810#comment-17791810
 ] 

Runtian Liu commented on CASSANDRA-19120:
-

Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It is the combination of this adding remote replicas
{code:java}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}
and this
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}
not decrementing when participant is remote.

If we have participant which is remote, we will have problem when ack:
{code:java}
void ack(InetAddressAndPort from)
    {
        if (shouldBlockOn.test(from))
        {
            pendingRepairs.remove(repairPlan.lookup(from));
            latch.decrement();
        }
    } {code}
Remote node won't get latch count down.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791756#comment-17791756
 ] 

Jaydeepkumar Chovatia edited comment on CASSANDRA-19120 at 11/30/23 10:20 PM:
--

Sure.

Say we are reading with LOCAL_QUORUM with a replication factor of 3. Now, let's 
understand the input parameters for the following: 

 
{code:java}
public BlockingPartitionRepair(DecoratedKey key, Map 
repairs, ReplicaPlan.ForWrite repairPlan, Predicate 
shouldBlockOn) {code}
 * {+}_repairPlan_{+}: This could have two local nodes _L1, L2_ and one remote 
node _R1_ as the _Selector_ [1] could select the remote node as [~curlylrt] 
mentioned above. So _repairPlan_ parameter has {{_}L1, L2, R1{_}}
 * {+}_repairs_{+}: This also has _L1_ and _R1_ because these two might be 
stale and require an update. L2 is not included as it has the latest data.
 * Now, at the following line, _pendingRepairs_ will have two nodes {{_}L1, 
R1{_}}.  
{code:java}
this.pendingRepairs = new ConcurrentHashMap<>(repairs); {code}

 *  so _adjustedBlockFor_ will be set to 3 at the following line:
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum(); {code}

 * Let's analyse L1, L2, and R1 for the following condition: 
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint())) {code}

 * 
 ** L1: _false_ as it is a participant
 ** L2: _true_ as it is not a participant and local as well 
//{_}adjustedBlockFor{_} 3-->2
 ** R1: _false_ as it is a participant
 * The following lines sets _blockFor_ and _latch_ to the value 2 
{code:java}
this.blockFor = adjustedBlockFor; 
...
latch = newCountDownLatch(Math.max(blockFor, 0));{code}

 * But when we receive a response from {_}R1{_}, then line 122 [2] 
{color:#de350b}excludes{color} the response, as a result, the latch does not 
decrement 
{code:java}
void ack(InetAddressAndPort from)
    {
        if (shouldBlockOn.test(from))
        {
            pendingRepairs.remove(repairPlan.lookup(from));
            latch.decrement();
        }
    } {code}

 

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411]

[2] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L122C9-L122C9]

 

 


was (Author: chovatia.jayd...@gmail.com):
Sure.

Say we are reading with LOCAL_QUORUM with a replication factor of 3. Now, let's 
understand the input parameters for the following: 

 
{code:java}
public BlockingPartitionRepair(DecoratedKey key, Map 
repairs, ReplicaPlan.ForWrite repairPlan, Predicate 
shouldBlockOn) {code}
 * {+}_repairPlan_{+}: This could have one local node _L1_ and one remote node 
_R1_ as the _Selector_ [1] could select the remote node as [~curlylrt] 
mentioned above. So _repairPlan_ parameter has {{_}L1, R1{_}}
 * {+}_repairs_{+}: This also has _L1_ and _R1_ because these two might be 
stale and require an update.
 * Now, at the following line, _pendingRepairs_ will have two nodes {{_}L1, 
R1{_}}.  
{code:java}
this.pendingRepairs = new ConcurrentHashMap<>(repairs); {code}

 *  so _adjustedBlockFor_ will be set to 2 at the following line:
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum(); {code}

 * This if condition will be _false_ always even for _R1_ because _repairs_ 
include {_}R1{_}, as a result, _adjustedBlockFor_ remains unchanged, i.e., _2_
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint())) {code}

 * The following lines sets _blockFor_ and _latch_ to the value 2 
{code:java}
this.blockFor = adjustedBlockFor; 
...
latch = newCountDownLatch(Math.max(blockFor, 0));{code}

 * But when we receive a response from {_}R1{_}, then line 122 [2] 
{color:#de350b}excludes{color} the response, as a result, the latch does not 
decrement 
{code:java}
void ack(InetAddressAndPort from)
    {
        if (shouldBlockOn.test(from))
        {
            pendingRepairs.remove(repairPlan.lookup(from));
            latch.decrement();
        }
    } {code}

 

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411]

[2] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L122C9-L122C9]

 

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-2

[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791808#comment-17791808
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

Yeah, [~smiklosovic] and [~curlylrt], _adjustedBlockFor_ would be 3 instead of 
2.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791807#comment-17791807
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 10:13 PM:
--

Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It is the combination of this adding remote replicas

{code}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}

and this

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

not decrementing when participant is remote.


was (Author: smiklosovic):
Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It is the combination of this adding remote replicas

{code}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}

and this

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

not decementing when participant is remote.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: com

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791807#comment-17791807
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 10:11 PM:
--

Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It it the combination of this adding remote replicas

{code}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}

and this

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

not decementing when participant is remote.


was (Author: smiklosovic):
Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791807#comment-17791807
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 10:12 PM:
--

Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It is the combination of this adding remote replicas

{code}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}

and this

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

not decementing when participant is remote.


was (Author: smiklosovic):
Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

It it the combination of this adding remote replicas

{code}
for (Replica replica : filter(live.all(), r -> 
!contacts.contains(r)))
{
contacts.add(replica);
if (--add == 0)
break;
}
{code}

and this

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

not decementing when participant is remote.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: comm

Re: [PR] Cassandra 18852: Make bulk writer resilient to cluster resize events [cassandra-analytics]

2023-11-30 Thread via GitHub


arjunashok commented on code in PR #17:
URL: 
https://github.com/apache/cassandra-analytics/pull/17#discussion_r1411327830


##
cassandra-analytics-core/src/main/java/org/apache/cassandra/spark/bulkwriter/RecordWriter.java:
##
@@ -110,20 +132,47 @@ public StreamResult write(Iterator> sourceIterato
 Map valueMap = new HashMap<>();
 try
 {
+List exclusions = 
failureHandler.getFailedInstances();
+Set> newRanges = 
initialTokenRangeMapping.getRangeMap().asMapOfRanges().entrySet()
+   
.stream()
+   
.filter(e -> !exclusions.contains(e.getValue()))
+   
.map(Map.Entry::getKey)
+   
.collect(Collectors.toSet());
+
 while (dataIterator.hasNext())
 {
+Tuple2 rowData = dataIterator.next();
+streamSession = maybeCreateStreamSession(taskContext, 
streamSession, rowData, newRanges, failureHandler);
+
+sessions.add(streamSession);

Review Comment:
   This was done to separate the session closures instead of having it 
scattered with checks for non-existent sessions. 
   
   However, on looking further it seems like we do not really need to eagerly 
create the session for the partition's token range. Will update for it to be 
created lazily.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791807#comment-17791807
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Anyway, that does not matter. Whole post can be read with pending being local. 
It is same thing.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791806#comment-17791806
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

Then I dont understand how [~chovatia.jayd...@gmail.com]  concluded that 
initial adjustedBlockFor is 2. This
{code}
blockFor += pending.count(InOurDc.replicas());
{code}

will be then 3.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791805#comment-17791805
 ] 

Runtian Liu edited comment on CASSANDRA-19120 at 11/30/23 10:05 PM:


{quote}and since we are assuming that pending is not in local DC, it will add 0 
to 2 which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page 
here.
{quote}
The pending is in local DC. Because we are adding a node in local DC. So the 
adjustedBlockFor is 3 not 2.


was (Author: JIRAUSER291682):
{quote}

and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

{quote}

The pending is in local DC. Because we are adding a node in local DC.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791805#comment-17791805
 ] 

Runtian Liu commented on CASSANDRA-19120:
-

{quote}

and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

{quote}

The pending is in local DC. Because we are adding a node in local DC.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791801#comment-17791801
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 9:54 PM:
-

If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}
will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does
{code:java}
blockFor += pending.count(InOurDc.replicas());
{code}
and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

Now this:
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum();
for (Replica participant : repairPlan.contacts())
{
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
}
this.blockFor = adjustedBlockFor;
{code}
This is a little bit confusing to go through but in order to decrement 
adjustedBlockFor (which is 2 at the beginning of this loop), any participant of 
the repairPlan (contacts) has to be in the local dc (in order to have 
shouldBlockOn returning true when it is InOurDc.endpoints()).

repairPlan.contacts() in that loop will ultimately take these contacts from 
(3). In that selector, (4)
{code:java}
int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
{code}
blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, 
pending might be one, so it will do again "blockFor += 
pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have 
add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 
1 = 1). Then it goes through live replicas and it will choose some which was 
not contacted {*}and here it might pick remote replica{*}.

so if we return to that "if"
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}
participant can be remote, so "repairs.containsKey(participant)" is "false" 
(when repairs are all local), so "!repairs.containsKey(participant)" is true. 
Then the second clause, shouldBlock, will evaluate to false, because 
participant is not local and shouldBlockOn is InOurDc.endpoints. So true && 
false = false. So "adjustedBlockFor" will not be decremented. But I think that 
it should be, because we are not going to block for remote participants. If it 
is not decremented, then blockFor will be set to 2 but it should be set only to 
1. If it is set to 2, then this
{code:java}
void ack(InetAddressAndPort from)
{
if (shouldBlockOn.test(from))
{
pendingRepairs.remove(repairPlan.lookup(from));
latch.decrement();
}
}
{code}
will decrement it just once, from 2 to 1 and it will never reach 0.

So it seems to me that we should do this:
{code:java}
if (!repairs.containsKey(participant) && 
!shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}
Notice the negation for shouldBlockOn. We should not block on remotes. 
shouldBlockOn will return true when participant is local. In order to not block 
on remote participants, (!shouldNotBlockOn) should be true, that is only when 
shouldNotBlockOn is false. That is only when participant is remote one.

Or maybe I am completely wrong? :) Would you guys be so nice to go through this 
and proof read it?

(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329]
(2) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184]
(3) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]
(4) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]


was (Author: smiklosovic):
If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}
will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does
{code:java}
blockFor += pending.count(InOurDc.repli

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791801#comment-17791801
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 9:48 PM:
-

If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}
will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does
{code:java}
blockFor += pending.count(InOurDc.replicas());
{code}
and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

Now this:
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum();
for (Replica participant : repairPlan.contacts())
{
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
}
this.blockFor = adjustedBlockFor;
{code}
This is a little bit confusing to go through but in order to decrement 
adjustedBlockFor (which is 2 at the beginning of this loop), any participant of 
the repairPlan (contacts) has to be in the local dc (in order to have 
shouldBlockOn returning true when it is InOurDc.endpoints()).

repairPlan.contacts() in that loop will ultimately take these contacts from 
(3). In that selector, (4)
{code:java}
int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
{code}
blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, 
pending might be one, so it will do again "blockFor += 
pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have 
add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 
1 = 1). Then it goes through live replicas and it will choose some which was 
not contacted {*}and here it might pick remote replica{*}.

so if we return to that "if"
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}
participant can be remote, so "repairs.containsKey(participant)" is "false" 
(when repairs are all local), so "!repairs.containsKey(participant)" is true. 
Then the second clause, shouldBlock, will evaluate to false, because 
participant is not local and shouldBlockOn is InOurDc.endpoints. So true && 
false = false. So "adjustedBlockFor" will not be decremented. But I think that 
it should be, because we are not going to block for remote participants. If it 
is not decremented, then blockFor will be set to 2 but it should be set only to 
1. If it is set to 2, then this
{code:java}
void ack(InetAddressAndPort from)
{
if (shouldBlockOn.test(from))
{
pendingRepairs.remove(repairPlan.lookup(from));
latch.decrement();
}
}
{code}
will increment it just once, from 2 to 1 and it will never reach 0.

So it seems to me that we should do this:
{code:java}
if (!repairs.containsKey(participant) && 
!shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}
Notice the negation for shouldBlockOn. We should not block on remotes. 
shouldBlockOn will return true when participant is local. In order to not block 
on remote participants, (!shouldNotBlockOn) should be true, that is only when 
shouldNotBlockOn is false. That is only when participant is remote one.

Or maybe I am completely wrong? :) Would you guys be so nice to go through this 
and proof read it?

(1) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329]
(2) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184]
(3) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577]
(4) 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572]


was (Author: smiklosovic):
If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}

will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does 

{code}
blockFor += pending.count(InOurDc.replicas());

[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791801#comment-17791801
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}

will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does 

{code}
blockFor += pending.count(InOurDc.replicas());
{code}

and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

Now this:

{code}
int adjustedBlockFor = repairPlan.writeQuorum();
for (Replica participant : repairPlan.contacts())
{
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
}
this.blockFor = adjustedBlockFor;
{code}

This is a little bit confusing to go through but in order to decrement 
adjustedBlockFor (which is 2 at the beginning of this loop), any participant of 
the repairPlan (contacts) has to be in the local dc (in order to have 
shouldBlockOn returning true when it is InOurDc.endpoints()). 

repairPlan.contacts() in that loop will ultimately take these contacts from 
(3). In that selector, (4)

{code}
int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
{code}

blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, 
pending might be one, so it will do again "blockFor += 
pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have 
add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 
1 = 1). Then it goes through live replicas and it will choose some which was 
not contacted *and here it might pick remote replica*.

so if we return to that "if"

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

participant can be remote, so "repairs.containsKey(participant)" is "false" 
(when repairs are all local), so "!repairs.containsKey(participant)" is true. 
Then the second clause, shouldBlock, will evaluate to false, because 
participant is not local and shouldBlockOn is InOurDc.endpoints. So true && 
false = false. So "adjustedBlockFor" will not be decremented. But I think that 
it should be, because we are not going to block for remote participants. If it 
is not decremented, then blockFor will be set to 2 but it should be set only to 
1. If it is set to 2, then this 

{code}
void ack(InetAddressAndPort from)
{
if (shouldBlockOn.test(from))
{
pendingRepairs.remove(repairPlan.lookup(from));
latch.decrement();
}
}
{code}

will increment it just once, from 2 to 1 and it will never reach 0.

So it seems to me that we should do this:

{code}
if (!repairs.containsKey(participant) && 
!shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

Notice the negation for shouldBlockOn. We should not block on remotes. 
shouldBlockOn will return true when participant is local. In order to not block 
on remote participants, (!shouldNotBlockOn) should be true, that is only when 
shouldNotBlockOn is false. That is only when participant is remote one.

Or maybe I am completely wrong? :) Would you guys be so nice to go through this 
are proof read it?
 
(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329
(2) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184
(3) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577
(4)

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791801#comment-17791801
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 9:37 PM:
-

If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}

will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does 

{code}
blockFor += pending.count(InOurDc.replicas());
{code}

and since we are assuming that pending is not in local DC, it will add 0 to 2 
which is 2 so yeah, adjustedBlockFor will be 2. We are on the same page here.

Now this:

{code}
int adjustedBlockFor = repairPlan.writeQuorum();
for (Replica participant : repairPlan.contacts())
{
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
}
this.blockFor = adjustedBlockFor;
{code}

This is a little bit confusing to go through but in order to decrement 
adjustedBlockFor (which is 2 at the beginning of this loop), any participant of 
the repairPlan (contacts) has to be in the local dc (in order to have 
shouldBlockOn returning true when it is InOurDc.endpoints()). 

repairPlan.contacts() in that loop will ultimately take these contacts from 
(3). In that selector, (4)

{code}
int add = consistencyLevel.blockForWrite(liveAndDown.replicationStrategy(), 
liveAndDown.pending()) - contacts.size();
{code}

blockForWrite will be called with strategy with RF = 3, LOCAL_QUORUM is 2, 
pending might be one, so it will do again "blockFor += 
pending.count(InOurDc.replicas());" which will be 2 += 0 = 2. In order to have 
add at least 1, when blockForWrite returns 2, contacts.size() has to be 1 (2 - 
1 = 1). Then it goes through live replicas and it will choose some which was 
not contacted *and here it might pick remote replica*.

so if we return to that "if"

{code}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

participant can be remote, so "repairs.containsKey(participant)" is "false" 
(when repairs are all local), so "!repairs.containsKey(participant)" is true. 
Then the second clause, shouldBlock, will evaluate to false, because 
participant is not local and shouldBlockOn is InOurDc.endpoints. So true && 
false = false. So "adjustedBlockFor" will not be decremented. But I think that 
it should be, because we are not going to block for remote participants. If it 
is not decremented, then blockFor will be set to 2 but it should be set only to 
1. If it is set to 2, then this 

{code}
void ack(InetAddressAndPort from)
{
if (shouldBlockOn.test(from))
{
pendingRepairs.remove(repairPlan.lookup(from));
latch.decrement();
}
}
{code}

will increment it just once, from 2 to 1 and it will never reach 0.

So it seems to me that we should do this:

{code}
if (!repairs.containsKey(participant) && 
!shouldBlockOn.test(participant.endpoint()))
adjustedBlockFor--;
{code}

Notice the negation for shouldBlockOn. We should not block on remotes. 
shouldBlockOn will return true when participant is local. In order to not block 
on remote participants, (!shouldNotBlockOn) should be true, that is only when 
shouldNotBlockOn is false. That is only when participant is remote one.

Or maybe I am completely wrong? :) Would you guys be so nice to go through this 
are proof read it?
 
(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlan.java#L326-L329
(2) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L181-L184
(3) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L577
(4) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572


was (Author: smiklosovic):
If we have a keyspace with RF 3 and we do a read with CL LOCAL_QUORUM, then 
this (looking into trunk version of that)
{code}
int adjustedBlockFor = repairPlan.writeQuorum();
{code}

will be set here (1). writeQuorum is computed in the constructor above, and 
that is computed in (2). It will firstly compute blockFor from the consistency 
level for LOCAL_QUORUM when RF is 3 which will be 2 (line 176) and since we 
have LOCAL_QUORUM it will go to the case where it does 

{code}
blockFor += pending.count(InOurDc.replicas());
{code}

and since we are assu

[jira] [Commented] (CASSANDRA-19011) Primary key -> row ID lookups are broken for skipping and intersections during SAI queries

2023-11-30 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791795#comment-17791795
 ] 

Michael Semb Wever commented on CASSANDRA-19011:


+1 (with one non-blocking question)

In the latest CI runs there are no failures that wasn't mentioned in my 
previous comment above.


> Primary key -> row ID lookups are broken for skipping and intersections 
> during SAI queries
> --
>
> Key: CASSANDRA-19011
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19011
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SAI
>Reporter: Alex Petrov
>Assignee: Mike Adamson
>Priority: Urgent
> Fix For: 5.0-beta
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Schema:
> {code:java}
> CREATE TABLE IF NOT EXISTS distributed_test_keyspace.tbl1 (pk1 bigint,ck1 
> bigint,v1 ascii,v2 bigint, PRIMARY KEY (pk1, ck1)) WITH  CLUSTERING ORDER BY 
> (ck1 ASC);
> CREATE CUSTOM INDEX v1_sai_idx ON distributed_test_keyspace.tbl1 (v1) USING 
> 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 
> 'normalize': 'true', 'ascii': 'true'}; ;
> CREATE CUSTOM INDEX v2_sai_idx ON distributed_test_keyspace.tbl1 (v2) USING 
> 'StorageAttachedIndex';
>  {code}
> {code:java}
> java.lang.AssertionError: skipped to an item smaller than the target; 
> iterator: 
> org.apache.cassandra.index.sai.disk.IndexSearchResultIterator@f399f79, target 
> key: PrimaryKey: { token: 8384965201802291970, partition: 
> DecoratedKey(8384965201802291970, c4bc1c50f9e76a50), clustering: 
> CLUSTERING:8b4b4c5991a4ea10 } , returned key: PrimaryKey: { token: 
> 8384965201802291970, partition: DecoratedKey(8384965201802291970, 
> c4bc1c50f9e76a50), clustering: CLUSTERING:89f1cf92658cb668 } 
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:95)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:39)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.tryToComputeNext(AbstractGuavaIterator.java:122)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIterator.tryToComputeNext(KeyRangeIterator.java:129)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.hasNext(AbstractGuavaIterator.java:116)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKey(StorageAttachedIndexSearcher.java:274)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKeyInRange(StorageAttachedIndexSearcher.java:203)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextSelectedKeyInRange(StorageAttachedIndexSearcher.java:234)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextRowIterator(StorageAttachedIndexSearcher.java:188)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:169)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:111)
>   at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>   at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:91)
>   at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:338)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
>   at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>   at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:346)
>   at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2186)
>   at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2581)
>   at 
> org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:163)
>   at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
>   at 
> relocated.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Thread.java:829) {code}
>  
> Unfortunately, there's no tooling for shrinking around SAI just yet, but I 
> have a programmatic repro using INSERT and DELETE statements. I will do my 
> best to post it asap, but thought this can already be useful for 

[jira] [Updated] (CASSANDRA-19127) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest

2023-11-30 Thread Ekaterina Dimitrova (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ekaterina Dimitrova updated CASSANDRA-19127:

Resolution: Duplicate
Status: Resolved  (was: Open)

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest
> -
>
> Key: CASSANDRA-19127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19127
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.1-beta
>
>
> The test is not flaky on 5.0 - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2590/workflows/47dedf52-87fd-4178-bc89-d179e58b6562
> But it is significantly flaky on trunk - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2591/workflows/1150aab2-4961-4fe3-a126-b96356fdb939/jobs/49867/tests
> {code:java}
> org.apache.cassandra.simulator.SimulationException: Failed on seed 
> 0xf2b8eff98afd45dd
>   Suppressed: java.lang.RuntimeException: 
> java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:537)
>   at 
> org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098)
>   at 
> org.apache.cassandra.simulator.ClusterSimulation.close(ClusterSimulation.java:854)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:361)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
>   at 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   Caused by: java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:529)
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.scheduleAndUpdateTopologyOnCompletion(KeyspaceActions.java:352)
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.next(KeyspaceActions.java:291)
>   at org.apache.cassandra.simulator.Actions.next(Actions.java:147)
>   at 
> org.apache.cassandra.simulator.Actions.lambda$streamNextSupplier$3(Actions.java:156)
>   at 
> org.apache.cassandra.simulator.Actions$LambdaAction.performSimple(Actions.java:63)
>   at 
> org.apache.cassandra.simulator.Action.performAndRegister(Action.java:468)
>   at org.apache.cassandra.simulator.Action.perform(Action.java:486)
>   at 
> org.apache.cassandra.simulator.ActionSchedule.next(ActionSchedule.java:379)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulation$2.next(PaxosSimulation.java:217)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:189)
>   at 
> org.apache.cassandra.simulator.paxos.PairOfSequencesPaxosSimulation.run(PairOfSequencesPaxosSimulation.java:351)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:365)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
>   at 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.i

[jira] [Commented] (CASSANDRA-19058) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791777#comment-17791777
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19058:
-

Thanks, [~samt]; I appreciate the quick response! Closing CASSANDRA-19127 in 
favor of this one

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11
> 
>
> Key: CASSANDRA-19058
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19058
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Sam Tunnicliffe
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> butler shows this as failing on J17 but here we see it fail on J11 
> [https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe/jobs/20463/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19058) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11

2023-11-30 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791770#comment-17791770
 ] 

Sam Tunnicliffe commented on CASSANDRA-19058:
-

bq. Should I reopen this ticket?

No problem, I've just done it. We are aware that we have to take another look 
at the Simulator tests so we will investigate. Thanks!

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11
> 
>
> Key: CASSANDRA-19058
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19058
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Sam Tunnicliffe
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> butler shows this as failing on J17 but here we see it fail on J11 
> [https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe/jobs/20463/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19058) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11

2023-11-30 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-19058:

Resolution: (was: Duplicate)
Status: Open  (was: Resolved)

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11
> 
>
> Key: CASSANDRA-19058
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19058
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: Sam Tunnicliffe
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> butler shows this as failing on J17 but here we see it fail on J11 
> [https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe/jobs/20463/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19066) Test Failure: org.apache.cassandra.distributed.upgrade.MixedModeFrom3LoggedBatchTest.testSimpleStrategy-_jdk11

2023-11-30 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791764#comment-17791764
 ] 

Sam Tunnicliffe commented on CASSANDRA-19066:
-

Added a minor comment on the PR, along with a suggestion that we need to follow 
up on downgradability and possibly revisit `storage_compatibility_mode`

> Test Failure: 
> org.apache.cassandra.distributed.upgrade.MixedModeFrom3LoggedBatchTest.testSimpleStrategy-_jdk11
> --
>
> Key: CASSANDRA-19066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19066
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/java
>Reporter: Sam Tunnicliffe
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci-for-19119-19066-19076.html, 
> ci-for-19119-19066-19076.tar.gz
>
>
> Failed in Circle:
> [https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe/jobs/20534/tests]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19117) Harry: Remove notion of Modification

2023-11-30 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19117:

Reviewers: Abe Ratnofsky, Caleb Rackliffe  (was: Abe Ratnofsky)

> Harry: Remove notion of Modification
> 
>
> Key: CASSANDRA-19117
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19117
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/fuzz
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: High
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19116) History Builder API 2.0

2023-11-30 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19116:

Reviewers: Abe Ratnofsky, Caleb Rackliffe  (was: Abe Ratnofsky)

> History Builder API 2.0
> ---
>
> Key: CASSANDRA-19116
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19116
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Test/fuzz
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Urgent
>
> Harry history Builder 2.0
>   * New History Builder API
>   * Add an ability to track LTS visiteb by partition in visited_lts 
> static column
>   * Add a model checker that checks against a different Cluster instance 
> (for example, flush vs no flush, local vs nonlocal, etc)
>   * Add an ability to issue LTSs out-of-order



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Jaydeepkumar Chovatia (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791756#comment-17791756
 ] 

Jaydeepkumar Chovatia commented on CASSANDRA-19120:
---

Sure.

Say we are reading with LOCAL_QUORUM with a replication factor of 3. Now, let's 
understand the input parameters for the following: 

 
{code:java}
public BlockingPartitionRepair(DecoratedKey key, Map 
repairs, ReplicaPlan.ForWrite repairPlan, Predicate 
shouldBlockOn) {code}
 * {+}_repairPlan_{+}: This could have one local node _L1_ and one remote node 
_R1_ as the _Selector_ [1] could select the remote node as [~curlylrt] 
mentioned above. So _repairPlan_ parameter has {{_}L1, R1{_}}
 * {+}_repairs_{+}: This also has _L1_ and _R1_ because these two might be 
stale and require an update.
 * Now, at the following line, _pendingRepairs_ will have two nodes {{_}L1, 
R1{_}}.  
{code:java}
this.pendingRepairs = new ConcurrentHashMap<>(repairs); {code}

 *  so _adjustedBlockFor_ will be set to 2 at the following line:
{code:java}
int adjustedBlockFor = repairPlan.writeQuorum(); {code}

 * This if condition will be _false_ always even for _R1_ because _repairs_ 
include {_}R1{_}, as a result, _adjustedBlockFor_ remains unchanged, i.e., _2_
{code:java}
if (!repairs.containsKey(participant) && 
shouldBlockOn.test(participant.endpoint())) {code}

 * The following lines sets _blockFor_ and _latch_ to the value 2 
{code:java}
this.blockFor = adjustedBlockFor; 
...
latch = newCountDownLatch(Math.max(blockFor, 0));{code}

 * But when we receive a response from {_}R1{_}, then line 122 [2] 
{color:#de350b}excludes{color} the response, as a result, the latch does not 
decrement 
{code:java}
void ack(InetAddressAndPort from)
    {
        if (shouldBlockOn.test(from))
        {
            pendingRepairs.remove(repairPlan.lookup(from));
            latch.decrement();
        }
    } {code}

 

[1] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411]

[2] 
[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L122C9-L122C9]

 

 

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apach

[jira] [Commented] (CASSANDRA-19011) Primary key -> row ID lookups are broken for skipping and intersections during SAI queries

2023-11-30 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791750#comment-17791750
 ] 

Caleb Rackliffe commented on CASSANDRA-19011:
-

+1 on both PRs

> Primary key -> row ID lookups are broken for skipping and intersections 
> during SAI queries
> --
>
> Key: CASSANDRA-19011
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19011
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SAI
>Reporter: Alex Petrov
>Assignee: Mike Adamson
>Priority: Urgent
> Fix For: 5.0-beta
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Schema:
> {code:java}
> CREATE TABLE IF NOT EXISTS distributed_test_keyspace.tbl1 (pk1 bigint,ck1 
> bigint,v1 ascii,v2 bigint, PRIMARY KEY (pk1, ck1)) WITH  CLUSTERING ORDER BY 
> (ck1 ASC);
> CREATE CUSTOM INDEX v1_sai_idx ON distributed_test_keyspace.tbl1 (v1) USING 
> 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 
> 'normalize': 'true', 'ascii': 'true'}; ;
> CREATE CUSTOM INDEX v2_sai_idx ON distributed_test_keyspace.tbl1 (v2) USING 
> 'StorageAttachedIndex';
>  {code}
> {code:java}
> java.lang.AssertionError: skipped to an item smaller than the target; 
> iterator: 
> org.apache.cassandra.index.sai.disk.IndexSearchResultIterator@f399f79, target 
> key: PrimaryKey: { token: 8384965201802291970, partition: 
> DecoratedKey(8384965201802291970, c4bc1c50f9e76a50), clustering: 
> CLUSTERING:8b4b4c5991a4ea10 } , returned key: PrimaryKey: { token: 
> 8384965201802291970, partition: DecoratedKey(8384965201802291970, 
> c4bc1c50f9e76a50), clustering: CLUSTERING:89f1cf92658cb668 } 
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:95)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:39)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.tryToComputeNext(AbstractGuavaIterator.java:122)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIterator.tryToComputeNext(KeyRangeIterator.java:129)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.hasNext(AbstractGuavaIterator.java:116)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKey(StorageAttachedIndexSearcher.java:274)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKeyInRange(StorageAttachedIndexSearcher.java:203)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextSelectedKeyInRange(StorageAttachedIndexSearcher.java:234)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextRowIterator(StorageAttachedIndexSearcher.java:188)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:169)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:111)
>   at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>   at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:91)
>   at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:338)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
>   at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>   at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:346)
>   at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2186)
>   at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2581)
>   at 
> org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:163)
>   at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
>   at 
> relocated.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Thread.java:829) {code}
>  
> Unfortunately, there's no tooling for shrinking around SAI just yet, but I 
> have a programmatic repro using INSERT and DELETE statements. I will do my 
> best to post it asap, but thought this can already be useful for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-

[jira] [Comment Edited] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791744#comment-17791744
 ] 

Stefan Miklosovic edited comment on CASSANDRA-19120 at 11/30/23 7:05 PM:
-

[~chovatia.jayd...@gmail.com] in your point 2), how did you come to the 
conclusion that blockFor (hence latch) will be set to 2? Could you elaborate 
point 2) in more detail?


was (Author: smiklosovic):
[~chovatia.jayd...@gmail.com] in your point 2), how you come to the conclusion 
that blockFor (hence latch) will be set to 2? Could you elaborate point 2) in 
more detail?

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791744#comment-17791744
 ] 

Stefan Miklosovic commented on CASSANDRA-19120:
---

[~chovatia.jayd...@gmail.com] in your point 2), how you come to the conclusion 
that blockFor (hence latch) will be set to 2? Could you elaborate point 2) in 
more detail?

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L88]
> (7)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L113]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19104) Standardize tablestats formatting and data units

2023-11-30 Thread Brad Schoening (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791743#comment-17791743
 ] 

Brad Schoening commented on CASSANDRA-19104:


[~zaaath] great, we should email the mailing list to confirm agreement on this 
format change.  But it seems obviously a good idea.

> Standardize tablestats formatting and data units
> 
>
> Key: CASSANDRA-19104
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19104
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tool/nodetool
>Reporter: Brad Schoening
>Assignee: Leo Toff
>Priority: Normal
>
> Tablestats reports output in plaintext, JSON or YAML. The human readable 
> output currently has a mix of KiB, bytes with inconsistent spacing
> Considering simplifying and defaulting output to 'human readable'. Machine 
> readable output is available as an option and the current mixed output 
> formatting is neither friendly for human or machine reading.
> !image-2023-11-27-13-49-14-247.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19058) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest-_jdk11

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791741#comment-17791741
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19058:
-

I made a mistake closing this one - it seems the error in CASSANDRA-18944 was 
different. It was fixed before the TCM patch was committed by CASSANDRA-18952.

Now, these runs show that the test started being flaky again with the TCM patch:
Pre-tcm clean run: 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2592/workflows/94342193-6c3c-4a0a-b596-8c780d8b4dfa/jobs/49963
Post-tcm run: 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2593/workflows/4895cc92-7d22-478f-b792-b7179a74ad1d/jobs/50059/tests

Though it seems that currently the error we see on latest trunk is different - 
simulationTest-_jdk11


{code:java}
FLAKY
org.apache.cassandra.simulator.test.ShortPaxosSimulationTest

org.apache.cassandra.simulator.SimulationException: Failed on seed 
0xc586939259d968e7
Suppressed: java.lang.RuntimeException: 
java.util.concurrent.TimeoutException
at 
org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:537)
at 
org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098)
at 
org.apache.cassandra.simulator.ClusterSimulation.close(ClusterSimulation.java:854)
at 
org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:361)
at 
org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
at 
org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Caused by: java.util.concurrent.TimeoutException
at 
org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:529)
Suppressed: java.util.concurrent.TimeoutException
Suppressed: java.util.concurrent.TimeoutException
Suppressed: java.util.concurrent.TimeoutException
Suppressed: java.util.concurrent.TimeoutException
Suppressed: java.util.concurrent.TimeoutException
Caused by: java.lang.NullPointerException
at 
org.apache.cassandra.simulator.cluster.KeyspaceActions.scheduleAndUpdateTopologyOnCompletion(KeyspaceActions.java:352)
at 
org.apache.cassandra.simulator.cluster.KeyspaceActions.next(KeyspaceActions.java:291)
at org.apache.cassandra.simulator.Actions.next(Actions.java:147)
at 
org.apache.cassandra.simulator.Actions.lambda$streamNextSupplier$3(Actions.java:156)
at 
org.apache.cassandra.simulator.Actions$LambdaAction.performSimple(Actions.java:63)
at 
org.apache.cassandra.simulator.Action.performAndRegister(Action.java:468)
at org.apache.cassandra.simulator.Action.perform(Action.java:486)
at 
org.apache.cassandra.simulator.ActionSchedule.next(ActionSchedule.java:379)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulation$2.next(PaxosSimulation.java:217)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:189)
at 
org.apache.cassandra.simulator.paxos.PairOfSequencesPaxosSimulation.run(PairOfSequencesPaxosSimulation.java:351)
at 
org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:365)
at 
org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
at 
org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
at 
org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessor

[jira] [Updated] (CASSANDRA-19011) Primary key -> row ID lookups are broken for skipping and intersections during SAI queries

2023-11-30 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19011:

Reviewers: Caleb Rackliffe, Michael Semb Wever  (was: Caleb Rackliffe, 
Michael Thornhill)

> Primary key -> row ID lookups are broken for skipping and intersections 
> during SAI queries
> --
>
> Key: CASSANDRA-19011
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19011
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SAI
>Reporter: Alex Petrov
>Assignee: Mike Adamson
>Priority: Urgent
> Fix For: 5.0-beta
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Schema:
> {code:java}
> CREATE TABLE IF NOT EXISTS distributed_test_keyspace.tbl1 (pk1 bigint,ck1 
> bigint,v1 ascii,v2 bigint, PRIMARY KEY (pk1, ck1)) WITH  CLUSTERING ORDER BY 
> (ck1 ASC);
> CREATE CUSTOM INDEX v1_sai_idx ON distributed_test_keyspace.tbl1 (v1) USING 
> 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 
> 'normalize': 'true', 'ascii': 'true'}; ;
> CREATE CUSTOM INDEX v2_sai_idx ON distributed_test_keyspace.tbl1 (v2) USING 
> 'StorageAttachedIndex';
>  {code}
> {code:java}
> java.lang.AssertionError: skipped to an item smaller than the target; 
> iterator: 
> org.apache.cassandra.index.sai.disk.IndexSearchResultIterator@f399f79, target 
> key: PrimaryKey: { token: 8384965201802291970, partition: 
> DecoratedKey(8384965201802291970, c4bc1c50f9e76a50), clustering: 
> CLUSTERING:8b4b4c5991a4ea10 } , returned key: PrimaryKey: { token: 
> 8384965201802291970, partition: DecoratedKey(8384965201802291970, 
> c4bc1c50f9e76a50), clustering: CLUSTERING:89f1cf92658cb668 } 
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:95)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:39)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.tryToComputeNext(AbstractGuavaIterator.java:122)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIterator.tryToComputeNext(KeyRangeIterator.java:129)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.hasNext(AbstractGuavaIterator.java:116)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKey(StorageAttachedIndexSearcher.java:274)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKeyInRange(StorageAttachedIndexSearcher.java:203)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextSelectedKeyInRange(StorageAttachedIndexSearcher.java:234)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextRowIterator(StorageAttachedIndexSearcher.java:188)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:169)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:111)
>   at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>   at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:91)
>   at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:338)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
>   at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>   at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:346)
>   at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2186)
>   at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2581)
>   at 
> org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:163)
>   at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
>   at 
> relocated.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Thread.java:829) {code}
>  
> Unfortunately, there's no tooling for shrinking around SAI just yet, but I 
> have a programmatic repro using INSERT and DELETE statements. I will do my 
> best to post it asap, but thought this can already be useful for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---

[jira] [Updated] (CASSANDRA-19011) Primary key -> row ID lookups are broken for skipping and intersections during SAI queries

2023-11-30 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-19011:

Reviewers: Caleb Rackliffe, Michael Thornhill  (was: Caleb Rackliffe)

> Primary key -> row ID lookups are broken for skipping and intersections 
> during SAI queries
> --
>
> Key: CASSANDRA-19011
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19011
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SAI
>Reporter: Alex Petrov
>Assignee: Mike Adamson
>Priority: Urgent
> Fix For: 5.0-beta
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> Schema:
> {code:java}
> CREATE TABLE IF NOT EXISTS distributed_test_keyspace.tbl1 (pk1 bigint,ck1 
> bigint,v1 ascii,v2 bigint, PRIMARY KEY (pk1, ck1)) WITH  CLUSTERING ORDER BY 
> (ck1 ASC);
> CREATE CUSTOM INDEX v1_sai_idx ON distributed_test_keyspace.tbl1 (v1) USING 
> 'StorageAttachedIndex' WITH OPTIONS = {'case_sensitive': 'false', 
> 'normalize': 'true', 'ascii': 'true'}; ;
> CREATE CUSTOM INDEX v2_sai_idx ON distributed_test_keyspace.tbl1 (v2) USING 
> 'StorageAttachedIndex';
>  {code}
> {code:java}
> java.lang.AssertionError: skipped to an item smaller than the target; 
> iterator: 
> org.apache.cassandra.index.sai.disk.IndexSearchResultIterator@f399f79, target 
> key: PrimaryKey: { token: 8384965201802291970, partition: 
> DecoratedKey(8384965201802291970, c4bc1c50f9e76a50), clustering: 
> CLUSTERING:8b4b4c5991a4ea10 } , returned key: PrimaryKey: { token: 
> 8384965201802291970, partition: DecoratedKey(8384965201802291970, 
> c4bc1c50f9e76a50), clustering: CLUSTERING:89f1cf92658cb668 } 
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:95)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIntersectionIterator.computeNext(KeyRangeIntersectionIterator.java:39)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.tryToComputeNext(AbstractGuavaIterator.java:122)
>   at 
> org.apache.cassandra.index.sai.iterators.KeyRangeIterator.tryToComputeNext(KeyRangeIterator.java:129)
>   at 
> org.apache.cassandra.utils.AbstractGuavaIterator.hasNext(AbstractGuavaIterator.java:116)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKey(StorageAttachedIndexSearcher.java:274)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextKeyInRange(StorageAttachedIndexSearcher.java:203)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextSelectedKeyInRange(StorageAttachedIndexSearcher.java:234)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.nextRowIterator(StorageAttachedIndexSearcher.java:188)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:169)
>   at 
> org.apache.cassandra.index.sai.plan.StorageAttachedIndexSearcher$ResultRetriever.computeNext(StorageAttachedIndexSearcher.java:111)
>   at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
>   at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:91)
>   at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:338)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
>   at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
>   at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
>   at 
> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:346)
>   at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:2186)
>   at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2581)
>   at 
> org.apache.cassandra.concurrent.ExecutionFailure$2.run(ExecutionFailure.java:163)
>   at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
>   at 
> relocated.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.base/java.lang.Thread.run(Thread.java:829) {code}
>  
> Unfortunately, there's no tooling for shrinking around SAI just yet, but I 
> have a programmatic repro using INSERT and DELETE statements. I will do my 
> best to post it asap, but thought this can already be useful for visibility.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


(cassandra-website) branch asf-staging updated (901e14d65 -> a97119f7f)

2023-11-30 Thread git-site-role
This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a change to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


 discard 901e14d65 generate docs for 04cb904e
 new a97119f7f generate docs for 04cb904e

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (901e14d65)
\
 N -- N -- N   refs/heads/asf-staging (a97119f7f)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/search-index.js |   2 +-
 site-ui/build/ui-bundle.zip | Bin 4881597 -> 4881597 bytes
 2 files changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19127) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791735#comment-17791735
 ] 

Ekaterina Dimitrova edited comment on CASSANDRA-19127 at 11/30/23 6:42 PM:
---

Failure caused by the TCM patch. Adding to the respective epic that targets 
those fixes.
All green at the commit before TCM - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2592/workflows/94342193-6c3c-4a0a-b596-8c780d8b4dfa/jobs/49963
Flaky after the TCM commit - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2593/workflows/4895cc92-7d22-478f-b792-b7179a74ad1d/jobs/50059/tests

EDIT: my bad, it is different failure that was fixed... continue bisecting


was (Author: e.dimitrova):
Failure caused by the TCM patch. Adding to the respective epic that targets 
those fixes.
All green at the commit before TCM - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2592/workflows/94342193-6c3c-4a0a-b596-8c780d8b4dfa/jobs/49963
Flaky after the TCM commit - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2593/workflows/4895cc92-7d22-478f-b792-b7179a74ad1d/jobs/50059/tests

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest
> -
>
> Key: CASSANDRA-19127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19127
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.1-beta
>
>
> The test is not flaky on 5.0 - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2590/workflows/47dedf52-87fd-4178-bc89-d179e58b6562
> But it is significantly flaky on trunk - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2591/workflows/1150aab2-4961-4fe3-a126-b96356fdb939/jobs/49867/tests
> {code:java}
> org.apache.cassandra.simulator.SimulationException: Failed on seed 
> 0xf2b8eff98afd45dd
>   Suppressed: java.lang.RuntimeException: 
> java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:537)
>   at 
> org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098)
>   at 
> org.apache.cassandra.simulator.ClusterSimulation.close(ClusterSimulation.java:854)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:361)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
>   at 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   Caused by: java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:529)
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.scheduleAndUpdateTopologyOnCompletion(KeyspaceActions.java:352)
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.next(KeyspaceActions.java:291)
>   at org.apache.cassandra.simulator.Actions.next(Actions.java:147)
>   at 
> org.apache.cassandra.simulator.Actions.lambda$streamNextSupplier$3(Actions.java:156)
>   at 
> org.apache.cassandra.simulator.Actions$LambdaAction.performSimple(Actions.java:63)
>   at 
> org.apache.cassandra.simulator.Action.performAndRegister(Action.java:468)
>   at org.apache.cassandra.simulator.Action.perform(Action.java:486)
>   at 
> org

[jira] [Commented] (CASSANDRA-19127) Test Failure: org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791735#comment-17791735
 ] 

Ekaterina Dimitrova commented on CASSANDRA-19127:
-

Failure caused by the TCM patch. Adding to the respective epic that targets 
those fixes.
All green at the commit before TCM - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2592/workflows/94342193-6c3c-4a0a-b596-8c780d8b4dfa/jobs/49963
Flaky after the TCM commit - 
https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2593/workflows/4895cc92-7d22-478f-b792-b7179a74ad1d/jobs/50059/tests

> Test Failure: 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest
> -
>
> Key: CASSANDRA-19127
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19127
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Priority: Normal
> Fix For: 5.1-beta
>
>
> The test is not flaky on 5.0 - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2590/workflows/47dedf52-87fd-4178-bc89-d179e58b6562
> But it is significantly flaky on trunk - 
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2591/workflows/1150aab2-4961-4fe3-a126-b96356fdb939/jobs/49867/tests
> {code:java}
> org.apache.cassandra.simulator.SimulationException: Failed on seed 
> 0xf2b8eff98afd45dd
>   Suppressed: java.lang.RuntimeException: 
> java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:537)
>   at 
> org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098)
>   at 
> org.apache.cassandra.simulator.ClusterSimulation.close(ClusterSimulation.java:854)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:361)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.java:346)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner$Run.run(PaxosSimulationRunner.java:34)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulationRunner.main(PaxosSimulationRunner.java:148)
>   at 
> org.apache.cassandra.simulator.test.ShortPaxosSimulationTest.simulationTest(ShortPaxosSimulationTest.java:101)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   Caused by: java.util.concurrent.TimeoutException
>   at 
> org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253)
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:529)
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
>   Suppressed: java.util.concurrent.TimeoutException
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.scheduleAndUpdateTopologyOnCompletion(KeyspaceActions.java:352)
>   at 
> org.apache.cassandra.simulator.cluster.KeyspaceActions.next(KeyspaceActions.java:291)
>   at org.apache.cassandra.simulator.Actions.next(Actions.java:147)
>   at 
> org.apache.cassandra.simulator.Actions.lambda$streamNextSupplier$3(Actions.java:156)
>   at 
> org.apache.cassandra.simulator.Actions$LambdaAction.performSimple(Actions.java:63)
>   at 
> org.apache.cassandra.simulator.Action.performAndRegister(Action.java:468)
>   at org.apache.cassandra.simulator.Action.perform(Action.java:486)
>   at 
> org.apache.cassandra.simulator.ActionSchedule.next(ActionSchedule.java:379)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulation$2.next(PaxosSimulation.java:217)
>   at 
> org.apache.cassandra.simulator.paxos.PaxosSimulation.run(PaxosSimulation.java:189)
>   at 
> org.apache.cassandra.simulator.paxos.PairOfSequencesPaxosSimulation.run(PairOfSequencesPaxosSimulation.java:351)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$Run.run(SimulationRunner.java:365)
>   at 
> org.apache.cassandra.simulator.SimulationRunner$BasicCommand.run(SimulationRunner.

(cassandra) branch cep-15-accord updated (8645cf49af -> 166bdee9bf)

2023-11-30 Thread bdeggleston
This is an automated email from the ASF dual-hosted git repository.

bdeggleston pushed a change to branch cep-15-accord
in repository https://gitbox.apache.org/repos/asf/cassandra.git


from 8645cf49af Ninja for CASSANDRA-19045: use the latest sha from trunk 
rather than an old one from 10 months ago
 add 166bdee9bf Reduce command deps

No new revisions were added by this update.

Summary of changes:
 modules/accord |   2 +-
 .../db/compaction/CompactionIterator.java  |  93 +++-
 .../service/accord/AccordCachingState.java |  54 +-
 .../service/accord/AccordCommandStore.java | 147 +-
 .../service/accord/AccordCommandsForKeys.java  | 252 ++
 .../cassandra/service/accord/AccordKeyspace.java   | 553 -
 .../service/accord/AccordMessageSink.java  |   2 +-
 .../service/accord/AccordObjectSizes.java  |  56 ++-
 .../service/accord/AccordSafeCommandStore.java | 180 +--
 .../service/accord/AccordSafeCommandsForKey.java   |  24 +-
 ...nd.java => AccordSafeCommandsForKeyUpdate.java} |  95 ++--
 ...ForKey.java => AccordSafeTimestampsForKey.java} |  59 ++-
 .../cassandra/service/accord/AccordStateCache.java | 157 --
 .../service/accord/CommandsForKeyUpdate.java   | 101 
 .../service/accord/CommandsForRanges.java  |  35 +-
 .../service/accord/async/AsyncLoader.java  |  74 ++-
 .../service/accord/async/AsyncOperation.java   |  23 +-
 .../accord/serializers/ApplySerializers.java   |  27 +-
 .../accord/serializers/CheckStatusSerializers.java |   1 +
 .../accord/serializers/CommitSerializers.java  |  27 +-
 .../accord/serializers/FetchSerializers.java   |  25 +-
 .../cassandra/service/accord/txn/TxnWrite.java |  13 +-
 .../compaction/CompactionAccordIteratorsTest.java  |  46 +-
 .../service/accord/AccordCachingStateTest.java |   7 +-
 .../service/accord/AccordCommandStoreTest.java | 194 +++-
 .../service/accord/AccordCommandTest.java  |  27 +-
 .../service/accord/AccordStateCacheTest.java   |  34 +-
 .../cassandra/service/accord/AccordTestUtils.java  |   9 +-
 .../service/accord/async/AsyncLoaderTest.java  | 185 +--
 .../service/accord/async/AsyncOperationTest.java   |  17 +-
 30 files changed, 1850 insertions(+), 669 deletions(-)
 create mode 100644 
src/java/org/apache/cassandra/service/accord/AccordCommandsForKeys.java
 copy src/java/org/apache/cassandra/service/accord/{AccordSafeCommand.java => 
AccordSafeCommandsForKeyUpdate.java} (50%)
 copy 
src/java/org/apache/cassandra/service/accord/{AccordSafeCommandsForKey.java => 
AccordSafeTimestampsForKey.java} (71%)
 create mode 100644 
src/java/org/apache/cassandra/service/accord/CommandsForKeyUpdate.java


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19056) Test failure: materialized_views_test.TestMaterializedViewsConsistency.test_multi_partition_consistent_reads_after_write

2023-11-30 Thread Sam Tunnicliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-19056:

Reviewers: Sam Tunnicliffe, Sam Tunnicliffe
   Sam Tunnicliffe, Sam Tunnicliffe  (was: Sam Tunnicliffe)
   Status: Review In Progress  (was: Patch Available)

> Test failure: 
> materialized_views_test.TestMaterializedViewsConsistency.test_multi_partition_consistent_reads_after_write
> 
>
> Key: CASSANDRA-19056
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19056
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/Materialized Views, Test/dtest/python
>Reporter: Sam Tunnicliffe
>Assignee: Marcus Eriksson
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> Fails or is flaky on both JDK 11 and 17 
> [https://app.circleci.com/pipelines/github/michaelsembwever/cassandra/256/workflows/c4fda8f1-a8d6-4523-be83-5e30b9de39fe/jobs/20462/parallel-runs/14]
>  
> {noformat}
> [node3] 'ERROR [MutationStage-1] 2023-11-23 21:18:31,953 
> JVMStabilityInspector.java:70 - Exception in thread 
> Thread[MutationStage-1,10,SharedPool]
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.cassandra.schema.TableMetadata.partitionKeyColumns()" because 
> "this.viewMetadata" is null
> at 
> org.apache.cassandra.db.view.ViewUpdateGenerator.(ViewUpdateGenerator.java:99)
> at 
> org.apache.cassandra.db.view.TableViews.generateViewUpdates(TableViews.java:227)
> at 
> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:193)
> at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:615)
> at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:447)
> at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:239)
> at 
> org.apache.cassandra.db.MutationVerbHandler.applyMutation(MutationVerbHandler.java:64)
> at 
> org.apache.cassandra.db.AbstractMutationVerbHandler.processMessage(AbstractMutationVerbHandler.java:60)
> at 
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:54)
> at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:102)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:122)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:51)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
> at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833)', [node3] 'ERROR 
> [MutationStage-2] 2023-11-23 21:18:31,953 JVMStabilityInspector.java:70 - 
> Exception in thread Thread[MutationStage-2,5,SharedPool]
> java.lang.NullPointerException: Cannot invoke 
> "org.apache.cassandra.schema.TableMetadata.partitionKeyColumns()" because 
> "this.viewMetadata" is null
> at 
> org.apache.cassandra.db.view.ViewUpdateGenerator.(ViewUpdateGenerator.java:99)
> at 
> org.apache.cassandra.db.view.TableViews.generateViewUpdates(TableViews.java:227)
> at 
> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:193)
> at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:615)
> at org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:447)
> at org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:239)
> at 
> org.apache.cassandra.db.MutationVerbHandler.applyMutation(MutationVerbHandler.java:64)
> at 
> org.apache.cassandra.db.AbstractMutationVerbHandler.processMessage(AbstractMutationVerbHandler.java:60)
> at 
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:54)
> at org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:102)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:122)
> at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:51)
> at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
> at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:143)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:833)']
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-uns

[jira] [Updated] (CASSANDRASC-82) Expose additional SSL configuration options for the Sidecar Service

2023-11-30 Thread Yifan Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRASC-82?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai updated CASSANDRASC-82:
-
  Fix Version/s: 1.0
Source Control Link: 
https://github.com/apache/cassandra-sidecar/commit/ad936f6482aee2a05fa45ba4fdd06267958298f6
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> Expose additional SSL configuration options for the Sidecar Service
> ---
>
> Key: CASSANDRASC-82
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-82
> Project: Sidecar for Apache Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: low-hanging-fruit, lowhanging-fruit, 
> pull-request-available
> Fix For: 1.0
>
>
> Sidecar exposes some SSL configuration options, but there are additional 
> options that Sidecar should be exposing for users. Similar to what Cassandra 
> offers in terms of configurations, we should be able to configure 
> {{cipher_suites}} as well as {{accepted_protocols}} under SSL configuration.
> Additionally, we should explore if there are any other SSL knobs that 
> Cassandra exposes that Sidecar doesn't and add it as part of this jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19120) local consistencies may get timeout if blocking read repair is sending the read repair mutation to other DC

2023-11-30 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791731#comment-17791731
 ] 

Runtian Liu commented on CASSANDRA-19120:
-

[~smiklosovic] I think trunk(5.1) and 5.0 is basically same. Although 5.0 the 
int blockFor() function is returning "writePlan.writeQuorum();", looks like the 
function is never called. The latch is still initialized with the local 
variable blockFor in the constructor. This is same as the trunk version of 
adjustedBlockFor. The problem here for blocking read repair is that the cross 
DC node has not been contacted before(No read from the cross DC node). It's the 
[selector|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L411]
 which is requiring 1 extra node to response when applying read repair mutation 
[[1]|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L572].

This means for replication factor 3 in each DC, sending read request with 
local_quorum assuming no speculative retry is triggered. When we get digest 
mismatch, blocking read repair will start. For normal scenario, blocking read 
repair will only try to repair the replicas that have been contacted(read) 
before, also we want to satisfy the consistency blockFor. If blockFor is larger 
than the number of replicas contacted before, we will add more nodes to apply 
the read-repair-mutation. For normal cases, blockFor should be same as the 
number of nodes contacted before. However, if we are adding a node in the same 
DC, the blockFor will be blockFor + pending which requires one more node to be 
repaired. And this "one more node" can be any node that has not been contacted 
before. As mentioned, the while the latch is initialized with number of nodes 
that the read-repair-mutations have been sent to but only count down when 
getting response from same DC. The coordinator node will get timeout 100% if a 
cross DC node is selected to perform the read-repair-mutation.

> local consistencies may get timeout if blocking read repair is sending the 
> read repair mutation to other DC 
> 
>
> Key: CASSANDRA-19120
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19120
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Runtian Liu
>Priority: Normal
> Attachments: image-2023-11-29-15-26-08-056.png, signature.asc
>
>
> For a two DCs cluster setup. When a new node is being added to DC1, for 
> blocking read repair triggered by local_quorum in DC1, it will require to 
> send read repair mutation to an extra node(1)(2). The selector for read 
> repair may select *ANY* node that has not been contacted before(3) instead of 
> selecting the DC1 nodes. If a node from DC2 is selected, this will cause 100% 
> timeout because of the bug described below:
> When we initialized the latch(4) for blocking read repair, the shouldBlockOn 
> function will only return true for local nodes(5), the blockFor value will be 
> reduced if a local node doesn't require repair(6). The blockFor is same as 
> the number of read repair mutation sent out. But when the coordinator node 
> receives the response from the target nodes, the latch only count down for 
> nodes in same DC(7). The latch will wait till timeout and the read request 
> will timeout.
> This can be reproduced if you have a constant load on a 3 + 3 cluster when 
> adding a node. If you have someway to trigger blocking read repair(maybe by 
> adding load using stress tool). If you use local_quorum consistency with a 
> constant read after write load in the same DC that you are adding node. You 
> will see read timeout issue from time to time because of the bug described 
> above
>  
> I think for read repair when selecting the extra node to do repair, we should 
> prefer local nodes than the nodes from other region. Also, we need to fix the 
> latch part so even if we send mutation to the nodes in other DC, we don't get 
> a timeout.
> (1)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L455]
> (2)[https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ConsistencyLevel.java#L183]
> (3)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/locator/ReplicaPlans.java#L458]
> (4)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L96]
> (5)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/org/apache/cassandra/service/reads/repair/BlockingPartitionRepair.java#L71]
> (6)[https://github.com/apache/cassandra/blob/cassandra-4.0.11/src/java/or

[jira] [Commented] (CASSANDRASC-82) Expose additional SSL configuration options for the Sidecar Service

2023-11-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRASC-82?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791732#comment-17791732
 ] 

ASF subversion and git services commented on CASSANDRASC-82:


Commit ad936f6482aee2a05fa45ba4fdd06267958298f6 in cassandra-sidecar's branch 
refs/heads/trunk from Francisco Guerrero
[ https://gitbox.apache.org/repos/asf?p=cassandra-sidecar.git;h=ad936f6 ]

CASSANDRASC-82: Expose additional SSL configuration options for the Sidecar 
Service

Patch by Francisco Guerrero; Reviewed by Doug Rohrer, Yifan Cai for 
CASSANDRASC-82


> Expose additional SSL configuration options for the Sidecar Service
> ---
>
> Key: CASSANDRASC-82
> URL: https://issues.apache.org/jira/browse/CASSANDRASC-82
> Project: Sidecar for Apache Cassandra
>  Issue Type: Improvement
>  Components: Configuration
>Reporter: Francisco Guerrero
>Assignee: Francisco Guerrero
>Priority: Normal
>  Labels: low-hanging-fruit, lowhanging-fruit, 
> pull-request-available
>
> Sidecar exposes some SSL configuration options, but there are additional 
> options that Sidecar should be exposing for users. Similar to what Cassandra 
> offers in terms of configurations, we should be able to configure 
> {{cipher_suites}} as well as {{accepted_protocols}} under SSL configuration.
> Additionally, we should explore if there are any other SSL knobs that 
> Cassandra exposes that Sidecar doesn't and add it as part of this jira.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



(cassandra-sidecar) branch trunk updated: CASSANDRASC-82: Expose additional SSL configuration options for the Sidecar Service

2023-11-30 Thread ycai
This is an automated email from the ASF dual-hosted git repository.

ycai pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-sidecar.git


The following commit(s) were added to refs/heads/trunk by this push:
 new ad936f6  CASSANDRASC-82: Expose additional SSL configuration options 
for the Sidecar Service
ad936f6 is described below

commit ad936f6482aee2a05fa45ba4fdd06267958298f6
Author: Francisco Guerrero 
AuthorDate: Wed Nov 15 17:09:49 2023 -0800

CASSANDRASC-82: Expose additional SSL configuration options for the Sidecar 
Service

Patch by Francisco Guerrero; Reviewed by Doug Rohrer, Yifan Cai for 
CASSANDRASC-82
---
 CHANGES.txt|  1 +
 src/main/dist/conf/sidecar.yaml|  5 ++
 .../cassandra/sidecar/config/SslConfiguration.java | 18 +
 .../sidecar/config/yaml/SslConfigurationImpl.java  | 90 +-
 .../sidecar/server/HttpServerOptionsProvider.java  | 10 ++-
 .../cassandra/sidecar/IntegrationTestBase.java |  2 +-
 .../sidecar/config/SidecarConfigurationTest.java   | 32 +---
 .../config/yaml/SslConfigurationImplTest.java  | 45 +++
 .../cassandra/sidecar/server/ServerSSLTest.java| 42 ++
 .../config/sidecar_multiple_instances.yaml |  5 ++
 .../resources/config/sidecar_single_instance.yaml  |  5 ++
 src/test/resources/config/sidecar_ssl.yaml | 12 +++
 12 files changed, 236 insertions(+), 31 deletions(-)

diff --git a/CHANGES.txt b/CHANGES.txt
index 8e38818..a457c5d 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,5 +1,6 @@
 1.0.0
 -
+ * Expose additional SSL configuration options for the Sidecar Service 
(CASSANDRASC-82)
  * Expose additional node settings (CASSANDRASC-84)
  * Sidecar does not handle keyspaces and table names with mixed case 
(CASSANDRASC-76)
  * Require gossip to be enabled for ring and token ranges mapping endpoints 
(CASSANDRASC-83)
diff --git a/src/main/dist/conf/sidecar.yaml b/src/main/dist/conf/sidecar.yaml
index 884e183..6104f69 100644
--- a/src/main/dist/conf/sidecar.yaml
+++ b/src/main/dist/conf/sidecar.yaml
@@ -113,7 +113,12 @@ sidecar:
 #use_openssl: true
 #handshake_timeout_sec: 10
 #client_auth: NONE # valid options are NONE, REQUEST, REQUIRED
+#accepted_protocols:
+# - TLSv1.2
+# - TLSv1.3
+#cipher_suites: []
 #keystore:
+#  type: PKCS12
 #  path: "path/to/keystore.p12"
 #  password: password
 #  check_interval_sec: 300
diff --git 
a/src/main/java/org/apache/cassandra/sidecar/config/SslConfiguration.java 
b/src/main/java/org/apache/cassandra/sidecar/config/SslConfiguration.java
index 768d197..2205e35 100644
--- a/src/main/java/org/apache/cassandra/sidecar/config/SslConfiguration.java
+++ b/src/main/java/org/apache/cassandra/sidecar/config/SslConfiguration.java
@@ -18,6 +18,8 @@
 
 package org.apache.cassandra.sidecar.config;
 
+import java.util.List;
+
 /**
  * Encapsulates SSL Configuration
  */
@@ -53,6 +55,22 @@ public interface SslConfiguration
  */
 String clientAuth();
 
+/**
+ * Return a list of the enabled cipher suites. The list of cipher suites 
must be provided in the
+ * desired order for its intended use.
+ *
+ * @return the enabled cipher suites
+ */
+List cipherSuites();
+
+/**
+ * Returns a list of enabled SSL/TLS protocols. The list of accepted 
protocols must be provided in the
+ * desired order of use.
+ *
+ * @return the enabled SSL/TLS protocols
+ */
+List secureTransportProtocols();
+
 /**
  * @return {@code true} if the keystore is configured, and the {@link 
KeyStoreConfiguration#path()} and
  * {@link KeyStoreConfiguration#password()} parameters are provided
diff --git 
a/src/main/java/org/apache/cassandra/sidecar/config/yaml/SslConfigurationImpl.java
 
b/src/main/java/org/apache/cassandra/sidecar/config/yaml/SslConfigurationImpl.java
index fbb7677..0239121 100644
--- 
a/src/main/java/org/apache/cassandra/sidecar/config/yaml/SslConfigurationImpl.java
+++ 
b/src/main/java/org/apache/cassandra/sidecar/config/yaml/SslConfigurationImpl.java
@@ -18,7 +18,10 @@
 
 package org.apache.cassandra.sidecar.config.yaml;
 
+import java.util.ArrayList;
 import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
 import java.util.stream.Collectors;
 
 import com.fasterxml.jackson.annotation.JsonProperty;
@@ -36,6 +39,8 @@ public class SslConfigurationImpl implements SslConfiguration
 public static final boolean DEFAULT_USE_OPEN_SSL = true;
 public static final long DEFAULT_HANDSHAKE_TIMEOUT_SECONDS = 10L;
 public static final String DEFAULT_CLIENT_AUTH = "NONE";
+public static final List DEFAULT_SECURE_TRANSPORT_PROTOCOLS
+= Collections.unmodifiableList(Arrays.asList("TLSv1.2", "TLSv1.3"));
 
 
 @JsonProperty("enabled")
@@ -49,6 +54,12 @@ public class SslConfigurationImpl implements SslConfiguration
 
 protected St

[jira] [Commented] (CASSANDRA-18947) Test failure: dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress

2023-11-30 Thread Ekaterina Dimitrova (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791728#comment-17791728
 ] 

Ekaterina Dimitrova commented on CASSANDRA-18947:
-

bq. Oh so you'd rather use async_almost_equal directly? Or you can use it as it 
is now for consistency with the rest of the test class, if sbdy improves/fixes 
assert_balanced in the future, which is the most probable, this test method 
would not benefit from that,.. To me it's not confusing. ICWYM but the little 
extra loop vs benefits... There are arguments both sides imo. I don't have a 
strong preference here.

Well, to me, it was a matter of clean code, and if we were thinking of fixing 
it, then better to fix it or go to the increased loop path until it is fixed. 
But if it is only me finding this confusing - +1. The repeated runs also LGTM. 
Thanks for looking into the problem. (ignoring the problem that was already 
mentioned to be solved in another ticket)

> Test failure: 
> dtest-novnode.disk_balance_test.TestDiskBalance.test_disk_balance_stress
> --
>
> Key: CASSANDRA-18947
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18947
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Ekaterina Dimitrova
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 5.0-rc
>
>
> Seen here:
> https://ci-cassandra.apache.org/job/Cassandra-5.0/72/testReport/dtest-novnode.disk_balance_test/TestDiskBalance/test_disk_balance_stress/
> h3.  
> {code:java}
> Error Message
> AssertionError: values not within 10.00% of the max: (2534183, 2762123, 
> 2423706) (node1)
> Stacktrace
> self =  def 
> test_disk_balance_stress(self): cluster = self.cluster if 
> self.dtest_config.use_vnodes: 
> cluster.set_configuration_options(values={'num_tokens': 256}) 
> cluster.populate(4).start() node1 = cluster.nodes['node1'] 
> node1.stress(['write', 'n=50k', 'no-warmup', '-rate', 'threads=100', 
> '-schema', 'replication(factor=3)', 
> 'compaction(strategy=SizeTieredCompactionStrategy,enabled=false)']) 
> cluster.flush() # make sure the data directories are balanced: for node in 
> cluster.nodelist(): > self.assert_balanced(node) disk_balance_test.py:48: _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> disk_balance_test.py:186: in assert_balanced assert_almost_equal(*new_sums, 
> error=0.1, error_message=node.name) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (2534183, 2762123, 2423706) 
> kwargs = {'error': 0.1, 'error_message': 'node1'}, error = 0.1, vmax = 
> 2762123 vmin = 2423706, error_message = 'node1' def 
> assert_almost_equal(*args, **kwargs): """ Assert variable number of arguments 
> all fall within a margin of error. @params *args variable number of numerical 
> arguments to check @params error Optional margin of error. Default 0.16 
> @params error_message Optional error message to print. Default '' Examples: 
> assert_almost_equal(sizes[2], init_size) assert_almost_equal(ttl_session1, 
> ttl_session2[0][0], error=0.005) """ error = kwargs['error'] if 'error' in 
> kwargs else 0.16 vmax = max(args) vmin = min(args) error_message = '' if 
> 'error_message' not in kwargs else kwargs['error_message'] assert vmin > vmax 
> * (1.0 - error) or vmin == vmax, \ > "values not within {:.2f}% of the max: 
> {} ({})".format(error * 100, args, error_message) E AssertionError: values 
> not within 10.00% of the max: (2534183, 2762123, 2423706) (node1) 
> tools/assertions.py:206: AssertionError
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



  1   2   3   >