[jira] [Commented] (CASSANDRA-19693) Relax slow_query_log_timeout for MultiNodeSAITest

2024-06-10 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853898#comment-17853898
 ] 

Alex Petrov commented on CASSANDRA-19693:
-

+1, LGTM. Thank you the patch!

> Relax slow_query_log_timeout for MultiNodeSAITest
> -
>
> Key: CASSANDRA-19693
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19693
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Feature/SAI, Test/fuzz
>Reporter: Caleb Rackliffe
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 5.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To stress the paging subsystem, we intentionally use a comically low fetch 
> size in {{{}MultiNodeSAITest{}}}. This can lead to some very slow queries 
> when we get matches into the hundreds of rows. It looks like CASSANDRA-19534 
> has gotten a little more aggressive about how the slow query timeout is 
> triggered, and there’s a lot of noise around this in the logs, even in local 
> runs. I think bumping the default slow query timeout and perhaps the native 
> protocol timeout a bit should clear this up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore

2024-06-10 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19695:

Test and Documentation Plan: Includes a test
 Status: Patch Available  (was: Open)

> Accord Jounral Simulation: Add instrumentation for Semaphore
> 
>
> Key: CASSANDRA-19695
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19695
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore

2024-06-10 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19695:

 Bug Category: Parent values: Code(13163)Level 1 values: Bug - Unclear 
Impact(13164)
   Complexity: Normal
  Component/s: Accord
Discovered By: Code Inspection
 Severity: Low
   Status: Open  (was: Triage Needed)

> Accord Jounral Simulation: Add instrumentation for Semaphore
> 
>
> Key: CASSANDRA-19695
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19695
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore

2024-06-10 Thread Alex Petrov (Jira)
Alex Petrov created CASSANDRA-19695:
---

 Summary: Accord Jounral Simulation: Add instrumentation for 
Semaphore
 Key: CASSANDRA-19695
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19695
 Project: Cassandra
  Issue Type: Bug
Reporter: Alex Petrov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19694) Make Accord timestamps strictly monotonic

2024-06-10 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19694:

Test and Documentation Plan: Covered by existing tests 
 Status: Patch Available  (was: Open)

> Make Accord timestamps strictly monotonic
> -
>
> Key: CASSANDRA-19694
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19694
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-19694) Make Accord timestamps strictly monotonic

2024-06-10 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-19694:
---

Assignee: Alex Petrov

> Make Accord timestamps strictly monotonic
> -
>
> Key: CASSANDRA-19694
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19694
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19694) Make Accord timestamps strictly monotonic

2024-06-10 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19694:

 Bug Category: Parent values: Correctness(12982)Level 1 values: 
Unrecoverable Corruption / Loss(13161)
   Complexity: Low Hanging Fruit
Discovered By: Code Inspection
 Severity: Critical
   Status: Open  (was: Triage Needed)

> Make Accord timestamps strictly monotonic
> -
>
> Key: CASSANDRA-19694
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19694
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Priority: Normal
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19694) Make Accord timestamps strictly monotonic

2024-06-10 Thread Alex Petrov (Jira)
Alex Petrov created CASSANDRA-19694:
---

 Summary: Make Accord timestamps strictly monotonic
 Key: CASSANDRA-19694
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19694
 Project: Cassandra
  Issue Type: Bug
  Components: Accord
Reporter: Alex Petrov






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19662) Data Corruption and OOM Issues During Schema Alterations

2024-06-01 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19662:

Component/s: Cluster/Schema
 (was: Client/java-driver)

> Data Corruption and OOM Issues During Schema Alterations 
> -
>
> Key: CASSANDRA-19662
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19662
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: BHARATH KUMAR
>Priority: Urgent
> Attachments: BufferUnderflow_plus_error
>
>
> h2. Description
>  
> *Overview:* The primary issue is data corruption occurring during schema 
> alterations (ADD/DROP column) on large tables(300+ columns and 6TB size ) in 
> the production cluster. This is accompanied by out-of-memory (OOM) errors and 
> other exceptions, specifically during batch reads. This problem has been 
> replicated on multiple clusters, running Apache Cassandra version 4.0.12 and 
> Datastax Java Driver Version: 4.17
> *Details:*
> *Main Issue:*
>  * *Data Corruption:* When dynamically adding a column to a table, the data 
> intended for the new column is shifted, causing misalignment in the data.
>  * *Symptoms:* The object implementing 
> {{com.datastax.oss.driver.api.core.cql.Row}} returns values shifted against 
> the column names returned by {{{}row.getColumnDefinitions(){}}}. The driver 
> returns a corrupted row, leading to incorrect data insertion.
> *Additional Issues:*
> *Exceptions:*
>  * {{java.nio.BufferUnderflowException}} during batch reads when ALTER TABLE 
> ADD/DROP column statements are issued.
>  * {{java.lang.ArrayIndexOutOfBoundsException}} in some cases.
>  * Buffer underflow exceptions with messages like "Invalid 32-bits integer 
> value, expecting 4 bytes but got 292".
>  * OOM errors mostly occur during ADD column operations, while other 
> exceptions occur during DELETE column operations.
>  * *Method Specific:* Errors occur specifically with 
> {{{}row.getList(columnName, Float.class){}}}, returning incorrect values.
> *Reproducibility:*
>  * The issue is reproducible on larger tables (300 columns, 6 TB size) but 
> not on smaller tables.
>  * SELECT * statements are used during reads
>  * *Method Specific:* Errors occur specifically with 
> {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. 
> However, the code registers a driver exception when calling the method 
> {{{}row.getList(columnName, Float.class){}}}. We pass the exact column name 
> obtained from {{{}row.getColumnDefinition{}}}, but it returns the wrong value 
> for a column with this name. This suggests that the issue lies with the 
> driver returning an object with incorrect properties, rather than with the 
> SQL query itself.
> *Debugging Efforts:*
>  * *Metadata Refresh:* Enabling metadata refresh did not resolve the issue.
>  * *Schema Agreement:* {{session.getCqlSession().checkSchemaAgreement()}} did 
> not detect inconsistencies during test execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

  Fix Version/s: 5.1-alpha1
  Since Version: 5.1-alpha1
Source Control Link: 
https://github.com/apache/cassandra/commit/b0ca509e7add760d187fcc5a9908d93d7c4fd6ec
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

Status: Ready to Commit  (was: Review In Progress)

Based on Aleksey's +1 on both patches, merging.

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-31 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850974#comment-17850974
 ] 

Alex Petrov edited comment on CASSANDRA-19664 at 5/31/24 9:39 AM:
--

[~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem 
to be related to {{add-opens}}; three dtest failures are unrelated. 


was (Author: ifesdjeen):
[~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem 
to be related to {add-opens}; three dtest failures are unrelated. 

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19215:

Status: Open  (was: Patch Available)

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19215:

Resolution: Fixed
Status: Resolved  (was: Open)

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-19215:
---

Assignee: Alex Petrov

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-31 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851018#comment-17851018
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

This should be fixed by [CASSANDRA-19534].

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Since Version: 3.0.0  (was: 4.1.5)

> Unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, 
> ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

  Since Version: 4.1.5
Source Control Link: 
https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

Committed to 4.1 with 
[dc17c29724d86547538cc8116ff1a90d36a0bf3a|https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a]
 and merged up to 
[5.0|https://github.com/apache/cassandra/commit/617a75843c9bfaf241249514f9604466f6c8ccab]
 and 
[trunk|https://github.com/apache/cassandra/commit/d10008d54bfb301ba12d022037b1caf78f18418b].

> Unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, 
> ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Status: Ready to Commit  (was: Review In Progress)

> Unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, 
> ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

Attachment: ci_summary-1.html

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-31 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850974#comment-17850974
 ] 

Alex Petrov commented on CASSANDRA-19664:
-

[~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem 
to be related to {add-opens}; three dtest failures are unrelated. 

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability

2024-05-31 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Summary: Unbounded queues in native transport requests lead to node 
instability  (was: unbounded queues in native transport requests lead to node 
instability)

> Unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, 
> ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-30 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

Attachment: ci_summary.html

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

Reviewers: Aleksey Yeschenko, Alex Petrov
   Status: Review In Progress  (was: Patch Available)

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

Test and Documentation Plan: Covered by existing tests in part; more tests 
coming with a follow-up patch
 Status: Patch Available  (was: Open)

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability

2024-05-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19664:

 Bug Category: Parent values: Correctness(12982)Level 1 values: 
Unrecoverable Corruption / Loss(13161)
   Complexity: Normal
  Component/s: Accord
Discovered By: Code Inspection
 Severity: Critical
   Status: Open  (was: Triage Needed)

> Accord Journal Determinism: PreAccept replay stability 
> ---
>
> Key: CASSANDRA-19664
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
> Project: Cassandra
>  Issue Type: Bug
>  Components: Accord
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> Currently, some messages, such as PreAccept can have some of their context 
> initialized on replay. This patch adds a concept of Context to Journal that 
> can be used for arbitrary information necessary for replaying them just the 
> way they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19662) Data Corruption and OOM Issues During Schema Alterations

2024-05-28 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850061#comment-17850061
 ] 

Alex Petrov commented on CASSANDRA-19662:
-

[~kumarbharath] which Cassandra version are you using? 

> Data Corruption and OOM Issues During Schema Alterations 
> -
>
> Key: CASSANDRA-19662
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19662
> Project: Cassandra
>  Issue Type: Bug
>  Components: Client/java-driver
>Reporter: BHARATH KUMAR
>Priority: Urgent
> Attachments: BufferUnderflow_plus_error
>
>
> h2. Description
>  
> *Overview:* The primary issue is data corruption occurring during schema 
> alterations (ADD/DROP column) on large tables(300+ columns and 6TB size ) in 
> the production cluster. This is accompanied by out-of-memory (OOM) errors and 
> other exceptions, specifically during batch reads. This problem has been 
> replicated on multiple clusters, running Apache Cassandra version 4.0.12 and 
> Datastax Java Driver Version: 4.17
> *Details:*
> *Main Issue:*
>  * *Data Corruption:* When dynamically adding a column to a table, the data 
> intended for the new column is shifted, causing misalignment in the data.
>  * *Symptoms:* The object implementing 
> {{com.datastax.oss.driver.api.core.cql.Row}} returns values shifted against 
> the column names returned by {{{}row.getColumnDefinitions(){}}}. The driver 
> returns a corrupted row, leading to incorrect data insertion.
> *Additional Issues:*
> *Exceptions:*
>  * {{java.nio.BufferUnderflowException}} during batch reads when ALTER TABLE 
> ADD/DROP column statements are issued.
>  * {{java.lang.ArrayIndexOutOfBoundsException}} in some cases.
>  * Buffer underflow exceptions with messages like "Invalid 32-bits integer 
> value, expecting 4 bytes but got 292".
>  * OOM errors mostly occur during ADD column operations, while other 
> exceptions occur during DELETE column operations.
>  * *Method Specific:* Errors occur specifically with 
> {{{}row.getList(columnName, Float.class){}}}, returning incorrect values.
> *Reproducibility:*
>  * The issue is reproducible on larger tables (300 columns, 6 TB size) but 
> not on smaller tables.
>  * SELECT * statements are used during reads
>  * *Method Specific:* Errors occur specifically with 
> {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. 
> However, the code registers a driver exception when calling the method 
> {{{}row.getList(columnName, Float.class){}}}. We pass the exact column name 
> obtained from {{{}row.getColumnDefinition{}}}, but it returns the wrong value 
> for a column with this name. This suggests that the issue lies with the 
> driver returning an object with incorrect properties, rather than with the 
> SQL query itself.
> *Debugging Efforts:*
>  * *Metadata Refresh:* Enabling metadata refresh did not resolve the issue.
>  * *Schema Agreement:* {{session.getCqlSession().checkSchemaAgreement()}} did 
> not detect inconsistencies during test execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19663) trunk fails to start

2024-05-28 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850060#comment-17850060
 ] 

Alex Petrov commented on CASSANDRA-19663:
-

> Is something else needed for 5.1?

There should not be anything different needed for trunk. It also seems to build 
on CI as recently as today.

> trunk fails to start
> 
>
> Key: CASSANDRA-19663
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19663
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Jon Haddad
>Priority: Normal
>
> On commit {{6701259bce91672a7c3ca9fb77ea7b040e9c}}, I get errors on 
> startup.
> Verified the build was successful:
> {noformat}
> easy-cass-lab.amazon-ebs.ubuntu: BUILD SUCCESSFUL
> easy-cass-lab.amazon-ebs.ubuntu: Total time: 1 minute 41 seconds
> {noformat}
> Running on a new Ubuntu instance:
> {noformat}
> INFO  [main] 2024-05-24 18:31:16,397 YamlConfigurationLoader.java:103 - 
> Configuration location: file:/usr/local/cassandra/trunk/conf/cassandra.yaml
> ERROR [main] 2024-05-24 18:31:16,470 CassandraDaemon.java:900 - Exception 
> encountered during startup
> java.lang.NoSuchMethodError: 'void 
> org.yaml.snakeyaml.LoaderOptions.setCodePointLimit(int)'
>   at 
> org.apache.cassandra.config.YamlConfigurationLoader.getDefaultLoaderOptions(YamlConfigurationLoader.java:433)
>   at 
> org.apache.cassandra.config.YamlConfigurationLoader$CustomConstructor.(YamlConfigurationLoader.java:278)
>   at 
> org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:135)
>   at 
> org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:116)
>   at 
> org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:403)
>   at 
> org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:265)
>   at 
> org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:250)
>   at 
> org.apache.cassandra.service.CassandraDaemon.applyConfig(CassandraDaemon.java:781)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:724)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:878)
> {noformat}
> Running on Java 17:
> {noformat}
> ubuntu@cassandra0:~$ java -version
> openjdk version "17.0.10" 2024-01-16
> OpenJDK Runtime Environment (build 17.0.10+7-Ubuntu-122.04.1)
> OpenJDK 64-Bit Server VM (build 17.0.10+7-Ubuntu-122.04.1, mixed mode, 
> sharing)
> {noformat}
> Built with 11.
> The only configs I changed:
> {noformat}
> cluster_name: "system_views"
> num_tokens: 4
> seed_provider:
>   class_name: "org.apache.cassandra.locator.SimpleSeedProvider"
>   parameters:
> seeds: "10.0.0.225"
> hints_directory: "/mnt/cassandra/hints"
> data_file_directories:
> - "/mnt/cassandra/data"
> commitlog_directory: "/mnt/cassandra/commitlog"
> concurrent_reads: 64
> concurrent_writes: 64
> trickle_fsync: true
> endpoint_snitch: "Ec2Snitch"
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-28 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Status: Ready to Commit  (was: Changes Suggested)

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-28 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

  Fix Version/s: 5.1-alpha1
Source Control Link: 
https://github.com/apache/cassandra/commit/2e05cd4c8dd22e458eb1d2dad9cd34936b470266
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-28 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: ci_summary-4.1.html

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, 
> ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-28 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849920#comment-17849920
 ] 

Alex Petrov edited comment on CASSANDRA-19158 at 5/28/24 9:09 AM:
--

[~samt] I think I have addressed all your comments, and got a CI run with one 
unrelated failure. Could you take another look?



was (Author: ifesdjeen):
[~samt] I think I have addressed all your comments, and got a clean CI now. 
Could you take another look?


> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-28 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: ci_summary-2.html

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-28 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849920#comment-17849920
 ] 

Alex Petrov commented on CASSANDRA-19158:
-

[~samt] I think I have addressed all your comments, and got a clean CI now. 
Could you take another look?


> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-27 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849812#comment-17849812
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

[~e.dimitrova] I believe it does. I was just finishing up the trunk and 4.1 
commits, and getting clean CI runs. I think it looks mostly good now.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, 
> ci_summary-trunk.html, ci_summary.html, image-2024-05-03-16-08-10-101.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, 
> screenshot-5.png, screenshot-6.png, screenshot-7.png, screenshot-8.png, 
> screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-27 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: ci_summary-trunk.html

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, 
> ci_summary-trunk.html, ci_summary.html, image-2024-05-03-16-08-10-101.png, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, 
> screenshot-5.png, screenshot-6.png, screenshot-7.png, screenshot-8.png, 
> screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-27 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: ci_summary-5.0.html

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, 
> ci_summary.html, image-2024-05-03-16-08-10-101.png, screenshot-1.png, 
> screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, 
> screenshot-6.png, screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h 50m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19664) Accord Jounral Determinism: PreAccept replay stability

2024-05-27 Thread Alex Petrov (Jira)
Alex Petrov created CASSANDRA-19664:
---

 Summary: Accord Jounral Determinism: PreAccept replay stability 
 Key: CASSANDRA-19664
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19664
 Project: Cassandra
  Issue Type: Bug
Reporter: Alex Petrov
Assignee: Alex Petrov


Currently, some messages, such as PreAccept can have some of their context 
initialized on replay. This patch adds a concept of Context to Journal that can 
be used for arbitrary information necessary for replaying them just the way 
they were executed the first time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-05-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846953#comment-17846953
 ] 

Alex Petrov commented on CASSANDRA-19592:
-

[~samt] looks good to me!

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-16 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: ci_summary.html

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-16 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: (was: ci_summary.html)

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-05-14 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: ci_summary-1.html

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-19134) Avoid flushing on every append in the LocalLog

2024-05-13 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-19134:
---

Assignee: Aleksey Yeschenko  (was: Alex Petrov)

> Avoid flushing on every append in the LocalLog
> --
>
> Key: CASSANDRA-19134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19134
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Membership
>Reporter: Marcus Eriksson
>Assignee: Aleksey Yeschenko
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> Right now, we are performing flush on every transformation that is appended 
> to the local log. While this does make _some_ sense, it may not be what we 
> always want to do. We have initially added this flush as a way to remedy node 
> bounces following schema changes, but this should no longer be necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-05-13 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845935#comment-17845935
 ] 

Alex Petrov commented on CASSANDRA-19592:
-

Updated the patch with comments from Sam, Marcus, and Stefan

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-13 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845930#comment-17845930
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Pushed a new commit that should address your comments [~maedhroz]

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-13 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845930#comment-17845930
 ] 

Alex Petrov edited comment on CASSANDRA-19534 at 5/13/24 1:47 PM:
--

[~maedhroz] thank you for the review! 
Pushed a new commit that should address your comments.


was (Author: ifesdjeen):
Pushed a new commit that should address your comments [~maedhroz]

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-17354) Bump java-driver dependency in Cassandra to latest 3.x series

2024-05-08 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-17354:

Resolution: Won't Fix
Status: Resolved  (was: Open)

As per Abe's message

> Bump java-driver dependency in Cassandra to latest 3.x series 
> --
>
> Key: CASSANDRA-17354
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17354
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/unit
>Reporter: Alex Petrov
>Priority: High
> Fix For: 5.x
>
>
> We depend on java-driver for testing, and developing/validating native 
> protocol changes. Unfortunately, the version of drvier that is  included with 
> Cassandra is quite ancient: 3.0.1. We need to bump this dependency to latest 
> in 3.x series, without upgrading to 4.0 at least for now. Unfortunately, this 
> is not a trivial change in build.xml (otherwise I would’ve done it rather 
> than opening this ticket), and bumping version breaks a few tests in all 
> versions, so those need to be fixed, too. 
> This should be a prerequiste for the next minor version release, too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-16135) Separate in-JVM test into smaller packages

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-16135:
---

Assignee: (was: Alex Petrov)

> Separate in-JVM test into smaller packages
> --
>
> Key: CASSANDRA-16135
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16135
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java
>Reporter: Alex Petrov
>Priority: High
> Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0.x
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Introduce a structure similar to how tags are organised in Cassandra Jira for 
> corresponding in-jvm dtests to help people find a right place for their tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-19215:
---

Assignee: (was: Alex Petrov)

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-07 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844258#comment-17844258
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

This is now largely superseded by work on [CASSANDRA-19534], as I have posted 
the patch there.

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-13138) SASI tries to fetch an extra page when resultset size is same size as page size

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-13138:
---

Assignee: (was: Alex Petrov)

> SASI tries to fetch an extra page when resultset size is same size as page 
> size
> ---
>
> Key: CASSANDRA-13138
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13138
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SASI
>Reporter: Alex Petrov
>Priority: Normal
>
> For example, in a dataset that would return 10 rows, SASI would try (and 
> return an empty page) to fetch the next page, while filtering and 2i will 
> return results correctly:
> {code}
>  pk | ck1 | ck2 | reg1 | reg2 | reg3
> +-+-+--+--+--
>   6 |   5 |   5 |5 |5 |   10
>   7 |   5 |   5 |5 |5 |   10
>   9 |   5 |   5 |5 |5 |   10
>   4 |   5 |   5 |5 |5 |   10
>   3 |   5 |   5 |5 |5 |   10
>   5 |   5 |   5 |5 |5 |   10
>   0 |   5 |   5 |5 |5 |   10
>   8 |   5 |   5 |5 |5 |   10
>   2 |   5 |   5 |5 |5 |   10
>   1 |   5 |   5 |5 |5 |   10
> ---MORE---
> (10 rows)
> {code}
> (that {{--MORE--}} shouldn't have been there) 
> This might be an inherent limitation, although even if it is we can opt out 
> for fetching limit+1 if the data limits aren't exhausted. Although it seems 
> that there should be a solution for it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-15413) Missing results on reading large frozen text map

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-15413:
---

Assignee: (was: Alex Petrov)

> Missing results on reading large frozen text map
> 
>
> Key: CASSANDRA-15413
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15413
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/SSTable
>Reporter: Tyler Codispoti
>Priority: Normal
>
> Cassandra version: 2.2.15
> I have been running into a case where, when fetching the results from a table 
> with a frozen>, if the number of results is greater than the 
> fetch size (default 5000), we can end up with missing data.
> Side note: The table schema comes from using KairosDB, but we've isolated 
> this issue to Cassandra itself. But it looks like this can cause problems for 
> users of KairosDB as well.
> Repro case. Tested against fresh install of Cassandra 2.2.15.
> 1. Create table (csqlsh)
> {code:sql}
> CREATE KEYSPACE test
>   WITH REPLICATION = { 
>'class' : 'SimpleStrategy', 
>'replication_factor' : 1 
>   };
>   CREATE TABLE test.test (
> name text,
> tags frozen>,
> PRIMARY KEY (name, tags)
>   ) WITH CLUSTERING ORDER BY (tags ASC);
> {code}
> 2. Insert data (python3)
> {code:python}
> import time
> from cassandra.cluster import Cluster
> cluster = Cluster(['127.0.0.1'])
> session = cluster.connect('test')
> for i in range(0, 2):
> session.execute(
> """
> INSERT INTO test (name, tags)  
> VALUES (%s, %s)
> """,
> ("test_name", {'id':str(i)})
> )
> {code}
>  
> 3. Flush
>  
> {code:java}
> nodetools flush{code}
>  
>  
> 4. Fetch data (python3)
> {code:python}
> import time
> from cassandra.cluster import Cluster
> cluster = Cluster(['127.0.0.1'], control_connection_timeout=5000)
> session = cluster.connect('test')
> session.default_fetch_size = 5000
> session.default_timeout = 120
> count = 0
> rows = session.execute("select tags from test where name='test_name'")
> for row in rows:
> count += 1
> print(count)
> {code}
> Result: 10111 (expected 2)
>  
> Changing the page size changes the result count. Some quick samples:
>  
> ||default_fetch_size||count||
> |5000|10111|
> |1000|1830|
> |999|1840|
> |998|1850|
> |2|2|
> |10|2|
>  
>  
> In short, I cannot guarantee I'll get all the results back unless the page 
> size > number of rows.
> This seems to get worse with multiple SSTables (eg nodetool flush between 
> some of the insert batches). When using replication, the issue can get 
> disgustingly bad - potentially giving a different result on each query.
> Interesting, if we pad the values on the tag map ("id" in this repro case) so 
> that the insertion is in lexicographical order, there is no issue. I believe 
> the issue also does not repro if I do not call "nodetools flush" before 
> querying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-13478) SASI Sparse mode overflow corrupts the SSTable

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-13478:
---

Assignee: (was: Alex Petrov)

> SASI Sparse mode overflow corrupts the SSTable
> --
>
> Key: CASSANDRA-13478
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13478
> Project: Cassandra
>  Issue Type: Bug
>  Components: Feature/SASI
> Environment: cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native 
> protocol v4 | ubuntu 14.04
>Reporter: jack chen
>Priority: Low
> Attachments: schema
>
>
> I have a table, the schema can be seen in attach file
> I would like to search the data using the timestamp data type with lt gt eq 
> as a query condition,
> Ex:
> {code}
> CREATE TABLE XXX.userlist (
> userid text PRIMARY KEY,
> lastposttime timestamp
> )
> Select * from userlist where lastposttime> '2017-04-01 16:00:00+';
> {code}
> There are 2 conditions :
> If I insert the data and then select it, the result will be correct
> But in case I insert data and then the next day I restart Cassandra, and 
> after that select the data, there will be no data selected
> The difference is that there is no Service restart on th next day in the 
> first manner. Actually, the data are still living in Cassandra, but TimeStamp 
> can’t be used as the query condition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-13243) testall failure in org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables-compression

2024-05-07 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-13243:
---

Assignee: (was: Alex Petrov)

> testall failure in 
> org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables-compression
> ---
>
> Key: CASSANDRA-13243
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13243
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Testing
>Reporter: Sean McCarthy
>Priority: Normal
>  Labels: test-failure, testall
> Attachments: TEST-org.apache.cassandra.index.sasi.SASIIndexTest.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_testall/1412/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testMultiExpressionQueriesWhereRowSplitBetweenSSTables_compression
> {code}
> Error Message
> [key0, key11, key12, key13, key14, key6, key7, key8] expected:<10> but was:<8>
> {code}{code}
> Stacktrace
> junit.framework.AssertionFailedError: [key0, key11, key12, key13, key14, 
> key6, key7, key8] expected:<10> but was:<8>
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables(SASIIndexTest.java:567)
>   at 
> org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables(SASIIndexTest.java:452)
> {code}{code}
> Standard Output
> ERROR [main] 2017-02-17 23:02:40,404 ?:? - SLF4J: stderr
> INFO  [main] 2017-02-17 23:02:40,830 ?:? - Configuration location: 
> file:/home/automaton/cassandra/test/conf/cassandra-murmur.yaml
> DEBUG [main] 2017-02-17 23:02:40,831 ?:? - Loading settings from 
> file:/home/automaton/cassandra/test/conf/cassandra-murmur.yaml
> INFO  [main] 2017-02-17 23:02:41,678 ?:? - Node 
> configuration:[allocate_tokens_for_keyspace=null; authenticator=null; 
> authorizer=null; auto_bootstrap=true; auto_snapshot=true; back_pres
> ...[truncated 416882 chars]...
> .957KiB), biggest 4.957KiB, smallest 4.957KiB
> DEBUG [CompactionExecutor:3] 2017-02-17 23:03:16,787 ?:? - Compacted 
> (cb40-f565-11e6-8e91-7511b7f59d65) 4 sstables to 
> [/home/automaton/cassandra/build/test/cassandra/data:231/system/local-7ad54392bcdd35a684174e047860b377/md-85-big,]
>  to level=0.  0.466KiB to 0.258KiB (~55% of original) in 58ms.  Read 
> Throughput = 7.914KiB/s, Write Throughput = 4.380KiB/s, Row Throughput = 
> ~2/s.  4 total partitions merged to 1.  Partition merge counts were {4:1, }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-03 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843389#comment-17843389
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

These tests look really good! I haven’t expected the one patched node scenario 
to work that well but glad that it helps even in that case.

Thank you for checking!

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, 
> image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, 
> screenshot-7.png, screenshot-8.png, screenshot-9.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-05-03 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843375#comment-17843375
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Thank you for looking into this. Just to make sure, patch works both on the 
coordinator and replica side, so it would make most sense to compare two 
clusters: one with a patch and one without. 

There might be some improvement if we only have one node using deadlines but 
then all three nodes will benefit from replica side shedding while coordinator 
side shedding will work just for one of them. I think having all nodes with a 
patch will have a more pronounced effect.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.1.x, 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-29 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842171#comment-17842171
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

This is great, thank you for testing!

My 100s timeout was erring (probably too far) on the side of sticking to the 
old behaviour. I was slightly concerned that people will see timeouts and 
conclude this is not something they want. But unfortunately there’s no way for 
us to produce reasonable workload balance without shedding some load and timing 
out lagging requests. I will update a default to 12s.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-29 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842112#comment-17842112
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests? I suggest setting  {{native_transport_timeout_in_ms}} to about 10 (or 12 
max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really 
want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, 
but this is optional, as we will not roll it out with this setting enabled.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-29 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842112#comment-17842112
 ] 

Alex Petrov edited comment on CASSANDRA-19534 at 4/29/24 5:24 PM:
--

[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests against the branch posted above? I suggest setting  
{{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and 
{{internode_timeout}} to {{true}} for starters. If you really want to push the 
limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is 
optional, as we will not roll it out with this setting enabled.


was (Author: ifesdjeen):
[~brandon.williams] [~rustyrazorblade] would you be so kind to try running your 
tests? I suggest setting  {{native_transport_timeout_in_ms}} to about 10 (or 12 
max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really 
want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, 
but this is optional, as we will not roll it out with this setting enabled.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Test and Documentation Plan: Includes tests, also was tested separately; 
screenshots and description attached
 Status: Patch Available  (was: Open)

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: ci_summary.html

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-04-29 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19158:

Attachment: ci_summary.html

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-04-26 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841204#comment-17841204
 ] 

Alex Petrov commented on CASSANDRA-19592:
-

Compact storage related test is now fixed in the pushed version. 

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-04-26 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19592:

Attachment: ci_summary.html

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html
>
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-26 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19221:

  Since Version: 5.0-alpha1
Source Control Link: 
https://github.com/apache/cassandra/commit/38512a469cef06770384423d0b30e3e85b511258
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-04-26 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19592:

Test and Documentation Plan: Tests included
 Status: Patch Available  (was: Open)

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-04-26 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19592:

 Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Implementation(12988)
   Complexity: Normal
  Component/s: Cluster/Schema
Discovered By: Code Inspection
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
> ---
>
> Key: CASSANDRA-19592
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Schema
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
> which the table was created between node bounces and between nodes with 
> different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS

2024-04-26 Thread Alex Petrov (Jira)
Alex Petrov created CASSANDRA-19592:
---

 Summary: Expand CREATE TABLE CQL on a coordinating node before 
submitting to CMS
 Key: CASSANDRA-19592
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19592
 Project: Cassandra
  Issue Type: Bug
Reporter: Alex Petrov
Assignee: Alex Petrov


This is done to unblock CASSANDRA-12937 and allow preserving defaults with 
which the table was created between node bounces and between nodes with 
different configurations. For now, we are preserving 5.0 behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression

2024-04-26 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841183#comment-17841183
 ] 

Alex Petrov commented on CASSANDRA-12937:
-

Yes, tried this locally, wrote a bunch of tests, patch coming up as soon as 
python dtests wrap up!

And yes, it seemed like we should just replicate what 5.0 does right now, and 
it implicitly does this via schema mutations. Schema mutations are created 
using coordinating node's defaults. Since we are not using schema mutations in 
5.1 anymore, I thought expanding CQL is the second best option.

> Default setting (yaml) for SSTable compression
> --
>
> Key: CASSANDRA-12937
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12937
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Michael Semb Wever
>Assignee: Stefan Miklosovic
>Priority: Low
>  Labels: AdventCalendar2021
> Fix For: 5.x
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> In many situations the choice of compression for sstables is more relevant to 
> the disks attached than to the schema and data.
> This issue is to add to cassandra.yaml a default value for sstable 
> compression that new tables will inherit (instead of the defaults found in 
> {{CompressionParams.DEFAULT}}.
> Examples where this can be relevant are filesystems that do on-the-fly 
> compression (btrfs, zfs) or specific disk configurations or even specific C* 
> versions (see CASSANDRA-10995 ).
> +Additional information for newcomers+
> Some new fields need to be added to {{cassandra.yaml}} to allow specifying 
> the field required for defining the default compression parameters. In 
> {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for 
> the default compression. This field should be initialized in 
> {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where 
> {{CompressionParams.DEFAULT}} was used the code should call 
> {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some 
> copy of configured {{CompressionParams}}.
> Some unit test using {{OverrideConfigurationLoader}} should be used to test 
> that the table schema use the new default when a new table is created (see 
> CreateTest for some example).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression

2024-04-26 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841157#comment-17841157
 ] 

Alex Petrov commented on CASSANDRA-12937:
-

Looks like it is possible to solve this problem for now in a much simpler way. 
We can simply fully expand the {{CREATE TABLE}} on the coordinator and achieve 
persistence of arguments. I think we will need a CEP for a more sophisticated 
approach, which we probably should leave for later.

> Default setting (yaml) for SSTable compression
> --
>
> Key: CASSANDRA-12937
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12937
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Michael Semb Wever
>Assignee: Stefan Miklosovic
>Priority: Low
>  Labels: AdventCalendar2021
> Fix For: 5.x
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> In many situations the choice of compression for sstables is more relevant to 
> the disks attached than to the schema and data.
> This issue is to add to cassandra.yaml a default value for sstable 
> compression that new tables will inherit (instead of the defaults found in 
> {{CompressionParams.DEFAULT}}.
> Examples where this can be relevant are filesystems that do on-the-fly 
> compression (btrfs, zfs) or specific disk configurations or even specific C* 
> versions (see CASSANDRA-10995 ).
> +Additional information for newcomers+
> Some new fields need to be added to {{cassandra.yaml}} to allow specifying 
> the field required for defining the default compression parameters. In 
> {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for 
> the default compression. This field should be initialized in 
> {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where 
> {{CompressionParams.DEFAULT}} was used the code should call 
> {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some 
> copy of configured {{CompressionParams}}.
> Some unit test using {{OverrideConfigurationLoader}} should be used to test 
> that the table schema use the new default when a new table is created (see 
> CreateTest for some example).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-24 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840173#comment-17840173
 ] 

Alex Petrov edited comment on CASSANDRA-19534 at 4/24/24 7:17 AM:
--

Sorry for the lack of clarity; before this patch, there was no deadline at all. 
Tasks will live in the system essentially forever clogging queues doing busy 
work. I was intending to post a patch but it is currently in my CI queue; 
however otherwise ready to go. 

I believe with 12 seconds default, users will only see an improvement and there 
will be no learning curve at all. All configuration options are for the people 
who understand their request lifetimes and want to get an even better profile. 


was (Author: ifesdjeen):
Sorry for the lack of clarity; today there’s no deadline at all. Tasks will 
live in the system essentially forever clogging queues doing busy work. I was 
intending to post a patch but it is currently in my CI queue; however otherwise 
ready to go.



i believe with 12 seconds default, users will only see an improvement and there 
will be no learning curve at all. All configurable are for the people who 
understand their request lifetimes and want to get an even better profile.

 

 

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-23 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840204#comment-17840204
 ] 

Alex Petrov commented on CASSANDRA-19221:
-

Addressed your comments [~samt], both failures are timeouts that are unrelated 
to the patch. I believe we should split the {{MetadataChangeSimulationTest}} 
since after adding transient tests it seems to sometimes cross the timeout 
deadline.

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-23 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19221:

Attachment: ci_summary-1.html

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-23 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840173#comment-17840173
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Sorry for the lack of clarity; today there’s no deadline at all. Tasks will 
live in the system essentially forever clogging queues doing busy work. I was 
intending to post a patch but it is currently in my CI queue; however otherwise 
ready to go.



i believe with 12 seconds default, users will only see an improvement and there 
will be no learning curve at all. All configurable are for the people who 
understand their request lifetimes and want to get an even better profile.

 

 

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-23 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: Scenario 2 - QUEUE + Backpressure.jpg
Scenario 2 - QUEUE.jpg
Scenario 2 - Stock.jpg

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-23 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19534:

Attachment: Scenario 1 - QUEUE.jpg
Scenario 1 - QUEUE + Backpressure.jpg
Scenario 1 - Stock.jpg

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
> Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-23 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840058#comment-17840058
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

The main change is the introduction of (currently implicit) configurable 
{_}native request deadline{_}. No request, read or write, will be allowed to 
prolong its execution beyond this deadline. Some of the hidden places that 
would allow requests to stay overdue were local executor runnables, 
replica-side writes, and hints. Default is 12 seconds, since this is how much 
time 3.x driver (which I believe is still the most used version in the 
community) waits until removing its handlers after which any response from the 
server will just be ignored. Now, there is an _option_ to enable expiration 
based on the queue time, which will be _disabled_ by default to preserve 
existing semantics, but my tests have shown enabling it only has positive 
effects. We will try it out cautiously in different clusters over the next 
months and will see if tests match up with real loads before we change any of 
the defaults.

So by default behaviour will be as follows:
 # If request has spent more than 12 seconds in the NATIVE queue, we throw 
Overloaded exception back to the client. This timeout used to be max of 
read/write/range/counter rpc timeout.
 # If requests has spent less than 12 seconds, it is allowed to execute; any 
request issued by the coordinator can live:
 ## _either_ {{Verb.timeout}} number of milliseconds,
 ## _or_ up to the up to the native request deadline, as measured from the time 
when the request was admitted to the coordinator's NATIVE queue, whichever one 
of these is happening earlier.

Example 1, read timeout is 5 seconds:
 # Client sends a request; request spends 6 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; two replicas respond within 3 
seconds
 # Coordinator responds to the client with success

Example 2, read timeout is 5 seconds:
 # Client sends a request; request spends 6 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; one replica responds within 3 
seconds; other replicas fail to respond within 5 seconds of read timeout
 # Coordinator responds to the client with read timeout (preserves current 
behaviour)

Example 3, read timeout is 5 seconds:
 # Client sends a request; request spends 10 seconds in the NATIVE queue
 # Coordinator issues requests to replicas; all replicas fail to respond within 
2 seconds
 # Coordinator responds to the client with read timeout; if messages are still 
in queue on replicas, they will get dropped before processing

There will be a _new_ metric that shows how many of the timeouts would have 
been “blind timeouts” previously. I.e. client _would_ register them as 
timeouts, but we as server-side operators would be oblivious to them. This 
metric will keep us collectively motivated even if we see there is a slight 
uptick in timeouts after committing the patch.

Lastly, there is an option to say how much of the 12 seconds client requests 
are allowed to spend in the native queue. You can say that if there is a client 
request that has spent 80% of its max 12 seconds in the native queue, we start 
applying backpressure to the client socket (or throwing overloaded exception, 
depending on the value of {{{}native_transport_throw_on_overload{}}}). We have 
to be careful with enabling this one, since my tests have shown that while we 
see fewer timeouts server side, clients see more timeouts, because part of the 
time they consider “request time” is now spent somewhere in TCP queues, which 
we can not account for.
h3. New Configuration Params
h3. cql_start_time

Configures what is considered to be a base for the replica-side timeout. This 
has actually existed before, it is now actually safe to enable. It still 
defaults to {{REQUEST}} (processing start time is taken as a timeout base), and 
an alternative is {{QUEUE}} (queue admission time is taken as a timeout base). 
Unfortunately, there is no consistent view of the timeout base in the 
community: some people think that server-side read/write timeouts are how much 
time _replicas_ have to respond to coordinator. Some believe they mean how much 
time _coordinator_ has to respond to the client. This patch is agnostic to 
these beliefs. 
h3. native_transport_throw_on_overload

Whether we should apply backpressure to client (i.e. stop reading from the 
socket), or throw Overloaded exception. Default is socket backpressure, and 
this is probably fine for now. In principle, this can also be set by the client 
on per-connection basis via protocol options. However, 3.x series of the driver 
do not have this addition implemented, so in practice this is not really used. 
If used, setting from the client takes precedence.
h3. native_transport_timeout_in_ms

The absolute maximum amount of time the server has to respond to 

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-19 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838960#comment-17838960
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

I guess this can explain it. We have 32 read, 32 write threads, and 128 native 
threads, so 2:1 relation. Read queue is slightly deeper (about 80) requests, 
which is clear since latency there is probably higher (however depends on the 
request), and write queue is almost empty. We easily can have all 128 requests 
blocked in this case, so they can not really overload the downstream stages. 
Besides, there's no hints, so at least a part of the issue we may have in a 
distributed environment is not applicable. 

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

2024-04-19 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838910#comment-17838910
 ] 

Alex Petrov commented on CASSANDRA-19344:
-

+1

> Range movements involving transient replicas must safely enact changes to 
> read and write replica sets
> -
>
> Key: CASSANDRA-19344
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19344
> Project: Cassandra
>  Issue Type: Bug
>  Components: CI
>Reporter: Ekaterina Dimitrova
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.x
>
> Attachments: ci_summary-1.html, ci_summary.html, 
> remove-n4-post-19344.txt, remove-n4-pre-19344.txt, result_details.tar.gz
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> (edit) This was originally opened due to a flaky test 
> {{org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode-_jdk17}}
> The test can fail in two different ways:
> {code:java}
> junit.framework.AssertionFailedError: NOT IN CURRENT: 31 -- [(00,20), 
> (31,50)] at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.assertAllContained(TransientRangeMovementTest.java:203)
>  at 
> org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:183)
>  at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method) at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
>  at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code}
> as in here - 
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2639/workflows/32b92ce7-5e9d-4efb-8362-d200d2414597/jobs/55139/tests#failed-test-0]
> and
> {code:java}
> junit.framework.AssertionFailedError: nodetool command [removenode, 
> 6d194555-f6eb-41d0-c000-0003, --force] was not successful stdout: 
> stderr: error: Node /127.0.0.4:7012 is alive and owns this ID. Use 
> decommission command to remove it from the ring -- StackTrace -- 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.base/java.lang.Thread.run(Thread.java:833) Notifications: Error: 
> java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and 
> owns this ID. Use decommission command to remove it from the ring at 
> org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110)
>  at 
> org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682)
>  at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at 
> org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at 
> org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388)
>  at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at 
> org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at 
> org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129)
>  at 
> org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038)
>  at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at 
> org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  at 
> 

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-19 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1783#comment-1783
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Talked to [~brandon.williams] and checked the remains of the cluster in the bad 
state to observe that at least the symptoms match my own observations and the 
issue I have seen. 180+K of tasks in the Native queue. I am a bit surprised 
that read and write queues are almost empty (under 100 items in both), but 
depending on which node was coordinating this can be ok.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce

2024-04-18 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov reassigned CASSANDRA-19158:
---

Assignee: Alex Petrov

> Reuse native transport-driven futures in Debounce
> -
>
> Key: CASSANDRA-19158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19158
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
>
> Currently, we create a future in Debounce, then create one more future in 
> RemoteProcessor#sendWithCallback. This is further exacerbated by chaining 
> calls, when we first attempt to catch up from peer, and then from CMS.
> First of all, we should always only use a native transport timeout driven 
> futures returned from sendWithCallback, since they implement reasonable 
> retries under the hood, and are easy to bulk-configure (ie you can simply 
> change timeout in yaml and have all futures change their behaviour).
> Second, we should _chain_ futures and use map or andThen for fallback 
> operations such as trying to catch up from CMS after an unsuccesful attemp to 
> catch up from peer.
> This should significantly simplify the code and number of blocked/waiting 
> threads.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19514) When jvm-dtest is shutting down an instance TCM retries block the shutdown causing the test to fail

2024-04-18 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838664#comment-17838664
 ] 

Alex Petrov commented on CASSANDRA-19514:
-

+1 on the latest trunk patch! Thank you!

> When jvm-dtest is shutting down an instance TCM retries block the shutdown 
> causing the test to fail
> ---
>
> Key: CASSANDRA-19514
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19514
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Membership, Test/dtest/java
>Reporter: David Capwell
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Fix For: 5.1
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> org.apache.cassandra.distributed.test.log.RequestCurrentEpochTest#testRequestingPeerWatermarks
> {code}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException
>org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79)
>
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:540)
>
> org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098)
>
> org.apache.cassandra.distributed.test.log.RequestCurrentEpochTest.testRequestingPeerWatermarks(RequestCurrentEpochTest.java:77)
>java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  Caused by: java.util.concurrent.TimeoutException
>
> org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253)
>
> org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:532) 
> Suppressed: java.util.concurrent.TimeoutException
> {code}
> In debugger I found the blocked future and it was 
> src/java/org/apache/cassandra/tcm/EpochAwareDebounce.java waiting on 
> src/java/org/apache/cassandra/tcm/RemoteProcessor.java retries



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-17 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838360#comment-17838360
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Do you have observability data from the cluster per chance? Would you be able 
to maybe check out the Native, Read, and Write stages pending request counts?

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-17 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838355#comment-17838355
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

I am a bit surprised to see that on 4.1 we seem to stabilize when errors begin. 
In essence, the problem is that request lifetime is unbounded. There are 
several contributing factors, such as lifetimes of local runnables, hints being 
re-submitted on the local mutation queue, and mutations on the replica side not 
respecting message expiration deadlines. I think most of these should have been 
present in 4.1, too. Unless, of course, there is more than one problem. I have 
initially discovered it pre-5.0 though. 

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-17 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19221:

Test and Documentation Plan: Includes a test 
 Status: Patch Available  (was: Open)

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-17 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19221:

Attachment: ci_summary.html

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary.html
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837763#comment-17837763
 ] 

Alex Petrov commented on CASSANDRA-19221:
-

I've had a closer look at it, and wanted to mention that 5.0 behaviour is most 
likely uninteded; it contains at least one bug, and is potentially dangeroud. 
In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, 
{{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes 
({{.2}} and {{.3}}. As a result of this test, nodes have in fact swapped their 
IPs, but: 

  * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then 
{{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to 
be occupied, so an entire test can work only if you start the two nodes in 
parallel
  * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} 
message for a specific IP address for a node, which it doesn't find if you 
merely change the address in the conf file
  * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} 
in its peers table. 

In general, since we are using ip addresses for node identity, I am weary of 
allowing identity transfers for the occupied pars. By this I mean if {{ip <-> 
node id}} pair exists in the directory, we have to free up the IP address 
before the other node can claim it. So the test would look as follows:

So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to 
{{.4}} first, and only then can the freed up IP address be occupied again. 

Submitting a patch that fixes the peers table behaviour and codifies a 
requirement of a separate node for swapping addresses.

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB   16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  86.77 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  80.88 KiB  16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> After:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load        Tokens  Owns (effective)  Host ID                  
>              Rack
> UN  127.0.0.3  149.62 KiB  16      76.0%             
> 6d194555-f6eb-41d0-c000-0003  rack1
> UN  127.0.0.2  155.48 KiB  16      59.3%             
> 6d194555-f6eb-41d0-c000-0002  rack1
> UN  127.0.0.1  75.74 KiB   16      64.7%             
> 6d194555-f6eb-41d0-c000-0001  rack1
> {code}
> On previous tests of this I have created a table with a replication factor of 
> 1, inserted some data before the swap.   After the swap the data on nodes 2 
> and 3 is now missing. 
> One theory I have is that I am using different port numbers for the different 
> nodes, and I am only swapping the ip addresses and not the port numbers, so 
> the ip:port still looks unique
> i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044
> and 127.0.0.3:9044 becomes 127.0.0.3:9043
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster

2024-04-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837763#comment-17837763
 ] 

Alex Petrov edited comment on CASSANDRA-19221 at 4/16/24 3:28 PM:
--

I've had a closer look at it, and wanted to mention that 5.0 behaviour is most 
likely uninteded; it contains at least one bug, and is potentially dangeroud. 
In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, 
{{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes 
({{.2}} and {{.3}}). As a result of this test, nodes have in fact swapped their 
IPs, but: 

  * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then 
{{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to 
be occupied, so an entire test can work only if you start the two nodes in 
parallel
  * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} 
message for a specific IP address for a node, which it doesn't find if you 
merely change the address in the conf file
  * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} 
in its peers table. 

In general, since we are using ip addresses for node identity, I am weary of 
allowing identity transfers for the occupied pars. By this I mean if {{ip <-> 
node id}} pair exists in the directory, we have to free up the IP address 
before the other node can claim it. So the test would look as follows:

So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to 
{{.4}} first, and only then can the freed up IP address be occupied again. 

Submitting a patch that fixes the peers table behaviour and codifies a 
requirement of a separate node for swapping addresses.


was (Author: ifesdjeen):
I've had a closer look at it, and wanted to mention that 5.0 behaviour is most 
likely uninteded; it contains at least one bug, and is potentially dangeroud. 
In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, 
{{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes 
({{.2}} and {{.3}}. As a result of this test, nodes have in fact swapped their 
IPs, but: 

  * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then 
{{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to 
be occupied, so an entire test can work only if you start the two nodes in 
parallel
  * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} 
message for a specific IP address for a node, which it doesn't find if you 
merely change the address in the conf file
  * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} 
in its peers table. 

In general, since we are using ip addresses for node identity, I am weary of 
allowing identity transfers for the occupied pars. By this I mean if {{ip <-> 
node id}} pair exists in the directory, we have to free up the IP address 
before the other node can claim it. So the test would look as follows:

So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to 
{{.4}} first, and only then can the freed up IP address be occupied again. 

Submitting a patch that fixes the peers table behaviour and codifies a 
requirement of a separate node for swapping addresses.

> CMS: Nodes can restart with new ipaddress already defined in the cluster
> 
>
> Key: CASSANDRA-19221
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19221
> Project: Cassandra
>  Issue Type: Bug
>  Components: Transactional Cluster Metadata
>Reporter: Paul Chandler
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.1-alpha1
>
>
> I am simulating running a cluster in Kubernetes and testing what happens when 
> several pods go down and  ip addresses are swapped between nodes. In 4.0 this 
> is blocked and the node cannot be restarted.
> To simulate this I create a 3 node cluster on a local machine using 3 
> loopback addresses
> {code}
> 127.0.0.1
> 127.0.0.2
> 127.0.0.3
> {code}
> The nodes are created correctly and the first node is assigned as a CMS node 
> as shown:
> {code}
> bin/nodetool -p 7199 describecms
> {code}
> Cluster Metadata Service:
> {code}
> Members: /127.0.0.1:7000
> Is Member: true
> Service State: LOCAL
> {code}
> At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip 
> addresses for the rpc_address and listen_address 
>  
> The nodes come back as normal, but the nodeid has now been swapped against 
> the ip address:
> Before:
> {code}
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address    Load       Tokens  Owns (effective)  Host ID                   
>             Rack
> UN  127.0.0.3  75.2 KiB  

[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys

2024-04-16 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19128:

Source Control Link: 
https://github.com/apache/cassandra/commit/7623e4678b8ef131434f1de3522c6425c092dff9
 Resolution: Fixed
 Status: Resolved  (was: Ready to Commit)

> The result of applying a metadata snapshot via ForceSnapshot should return 
> the correct set of modified keys
> ---
>
> Key: CASSANDRA-19128
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19128
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Membership
>Reporter: Marcus Eriksson
>Assignee: Alex Petrov
>Priority: High
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It should use the same logic as Transformer::build to compare the updated CM 
> with the previous to derive the modified keys



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys

2024-04-16 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19128:

Reviewers: Marcus Eriksson
   Status: Review In Progress  (was: Patch Available)

> The result of applying a metadata snapshot via ForceSnapshot should return 
> the correct set of modified keys
> ---
>
> Key: CASSANDRA-19128
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19128
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Membership
>Reporter: Marcus Eriksson
>Assignee: Alex Petrov
>Priority: High
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It should use the same logic as Transformer::build to compare the updated CM 
> with the previous to derive the modified keys



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys

2024-04-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837615#comment-17837615
 ] 

Alex Petrov commented on CASSANDRA-19128:
-

[~marcuse] left his +1 on the pull request. 

> The result of applying a metadata snapshot via ForceSnapshot should return 
> the correct set of modified keys
> ---
>
> Key: CASSANDRA-19128
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19128
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Membership
>Reporter: Marcus Eriksson
>Assignee: Alex Petrov
>Priority: High
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It should use the same logic as Transformer::build to compare the updated CM 
> with the previous to derive the modified keys



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys

2024-04-16 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19128:

Status: Ready to Commit  (was: Review In Progress)

> The result of applying a metadata snapshot via ForceSnapshot should return 
> the correct set of modified keys
> ---
>
> Key: CASSANDRA-19128
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19128
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Cluster/Membership
>Reporter: Marcus Eriksson
>Assignee: Alex Petrov
>Priority: High
> Fix For: 5.1-alpha1
>
> Attachments: ci_summary-1.html, ci_summary.html
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> It should use the same logic as Transformer::build to compare the updated CM 
> with the previous to derive the modified keys



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

2024-04-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837573#comment-17837573
 ] 

Alex Petrov edited comment on CASSANDRA-19344 at 4/16/24 8:04 AM:
--

Wanted to point out a somewhat unintuitive albeit correct behaviour that 
involves Transient Replicas. I think it is worth talking through such things 
because pending ranges with transient replicas work slightly differently from 
their "normal" counterparts. 

We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, 
and 4 moving from 400 to 350.

Original/start state (READ/WRITE placements):

{code}
(400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]}
(MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]}
{code}

State after {{START_MOVE}} (which is the point at which streaming starts, so 
think of additional replicas as pending), for WRITE placements:

{code}
(400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]}
(MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), 
Transient(/127.0.0.1:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), 
Transient(/127.0.0.3:7012,(350,400])]}
{code}

READ placements at the same moment:

{code}
(400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]}
(MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]}
{code} 

Please note that READ placements are always a subset of WRITE ones (or, well, 
in a way: we can technically read from full to satisfy a transient read)
after FINISH_MOVE, we get for both READ and WRITE:

{code}
(400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), 
Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]}
(MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), 
Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} 
{code}

After executing START_MOVE, we get 3 full and no transient nodes for 
{{(200,300]}}. If we put transitions together, we see: 

{code}
1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
{code}

In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is 
now gaining a range, and should be a target for pending writes for this range. 
At the same time, it remains a _transient read replica_. In {{3.}}, 
{{127.0.0.04}} went from full to transient; it was kept full up till now since 
it was a streaming source, and to keep consistency levels correct, we 

What is unintuitive here 

[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets

2024-04-16 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837573#comment-17837573
 ] 

Alex Petrov commented on CASSANDRA-19344:
-

Wanted to point out a somewhat unintuitive albeit correct behaviour that 
involves Transient Replicas. I think it is worth talking through such things 
because pending ranges with transient replicas work slightly differently from 
their "normal" counterparts. 

We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, 
and 4 moving from 400 to 350.

Original/start state (READ/WRITE placements):

{code]
(400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]}
(MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]}
{code}

State after {{START_MOVE}} (which is the point at which streaming starts, so 
think of additional replicas as pending), for WRITE placements:

{code}
(400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]}
(MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), 
Transient(/127.0.0.1:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), 
Transient(/127.0.0.3:7012,(350,400])]}
{code}

READ placements at the same moment:

{code}
(400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), 
Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]}
(MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), 
Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]}
(100,200] -> [Full(/127.0.0.2:7012,(100,200]), 
Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]}
{code} 

Please note that READ placements are always a subset of WRITE ones (or, well, 
in a way: we can technically read from full to satisfy a transient read)
after FINISH_MOVE, we get for both READ and WRITE:

{code}
(400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), 
Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]}
(MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), 
Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]}
(200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
(300,350] -> [Full(/127.0.0.1:7012,(300,350]), 
Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]}
(350,400] -> [Full(/127.0.0.4:7012,(350,400]), 
Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} 
{code}

After executing START_MOVE, we get 3 full and no transient nodes for 
{{(200,300]}}. If we put transitions together, we see: 

{code}
1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]}
2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]}
3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), 
Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]}
{code}

In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is 
now gaining a range, and should be a target for pending writes for this range. 
At the same time, it remains a _transient read replica_. In {{3.}}, 
{{127.0.0.04}} went from full to transient; it was kept full up till now since 
it was a streaming source, and to keep consistency levels correct, we 

What is unintuitive here is that usually, with replication factor of 3, we 

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-15 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837419#comment-17837419
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

Sounds good, I'll tag you as soon as I have it up. Thank you [~rustyrazorblade]!

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

2024-04-15 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837416#comment-17837416
 ] 

Alex Petrov commented on CASSANDRA-19534:
-

[~rustyrazorblade] oh yes, that would exist in a single node as well. Think of 
a single node as a case of RF1 and coordinator and replica are colocated. I 
have just finished the last wrinkle in my patch, now just need to rebase and 
hope to post it ASAP. Hope it's not pressing, but wanted to indicate that 
unless you already have a patch for this, probably the quickest way is to check 
out what I got as what you describe should be well covered.

> unbounded queues in native transport requests lead to node instability
> --
>
> Key: CASSANDRA-19534
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Local Write-Read Paths
>Reporter: Jon Haddad
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 5.0-rc, 5.x
>
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=10 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression

2024-04-15 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837237#comment-17837237
 ] 

Alex Petrov commented on CASSANDRA-12937:
-

bq. Yes, I think this is the most ideal solution. If somebody wants to 
experiment with a new compressor and similar, there would need to be some knob 
to override it, like some JMX method or similar, and all risks attached to that 
(divergence of the configuration caused by operator's negligence) would be on 
him.

Some things are actually quite useful for gradual rollout. For example, 
compression. You probably do not want to rewrite your sstables across the 
entire cluster. Similar arguments may be made for canary deployments of 
memtable settings and other things. 

I agree that it is fine if these parameters are completely transient (i.e. if 
you have set it to something that diverges from the clusterwide value, it will 
get reverted back after the node bounce). In such case, probably they will not 
go through TCM and will be purely node-local.

Examples of things that are now configuable via yaml but will be configurable 
via TCM if we go ahead with this proposal: partitioner, memtable configuration, 
default compaction strategy, compression. As Sam has mentioned, "which specific 
value makes it into schema just depends on which instance acts as the 
coordinator for a given DCL statement".

bq. but I remain unconvinced that just picking the defaults from whatever node 
happens to be coordinating is the right way to go.

I have talked with Sam shortly just to make sure I understand it correctly 
before trying to describe it. Since this was first worded in a way that 
suggested a problem but not directly proposed a solution (possibly described 
elsewhere), I will attempt to do this. Sam has already described a part of the 
solution as:

bq. That should probably be in a parallel local datastructure though, not in 
the node's local log table as we don't want to ship those local defaults to 
peers when providing log catchup (because they should use their own defaults).

The part that was missing for me was where would the values be coming from, and 
what would be the precedence. When executing a {CREATE} statement on some node 
_without_ specifying, say, compression, the statement will be created and 
executed without the value for compression set at all. Every node will pick the 
value from its ephemeral parallel structure Sam described (which is also 
settable via JMX and alike like Stefan mentioned). If no value is present in 
this table, it will be picked from yaml (alternatively, we could just populate 
this structure from yaml, too, but I consider these things roughly equivalent).

> Default setting (yaml) for SSTable compression
> --
>
> Key: CASSANDRA-12937
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12937
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Local/Config
>Reporter: Michael Semb Wever
>Assignee: Stefan Miklosovic
>Priority: Low
>  Labels: AdventCalendar2021
> Fix For: 5.x
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> In many situations the choice of compression for sstables is more relevant to 
> the disks attached than to the schema and data.
> This issue is to add to cassandra.yaml a default value for sstable 
> compression that new tables will inherit (instead of the defaults found in 
> {{CompressionParams.DEFAULT}}.
> Examples where this can be relevant are filesystems that do on-the-fly 
> compression (btrfs, zfs) or specific disk configurations or even specific C* 
> versions (see CASSANDRA-10995 ).
> +Additional information for newcomers+
> Some new fields need to be added to {{cassandra.yaml}} to allow specifying 
> the field required for defining the default compression parameters. In 
> {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for 
> the default compression. This field should be initialized in 
> {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where 
> {{CompressionParams.DEFAULT}} was used the code should call 
> {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some 
> copy of configured {{CompressionParams}}.
> Some unit test using {{OverrideConfigurationLoader}} should be used to test 
> that the table schema use the new default when a new table is created (see 
> CreateTest for some example).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-12937) Default setting (yaml) for SSTable compression

2024-04-15 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837237#comment-17837237
 ] 

Alex Petrov edited comment on CASSANDRA-12937 at 4/15/24 1:08 PM:
--

bq. Yes, I think this is the most ideal solution. If somebody wants to 
experiment with a new compressor and similar, there would need to be some knob 
to override it, like some JMX method or similar, and all risks attached to that 
(divergence of the configuration caused by operator's negligence) would be on 
him.

Some things are actually quite useful for gradual rollout. For example, 
compression. You probably do not want to rewrite your sstables across the 
entire cluster. Similar arguments may be made for canary deployments of 
memtable settings and other things. 

I agree that it is fine if these parameters are completely transient (i.e. if 
you have set it to something that diverges from the clusterwide value, it will 
get reverted back after the node bounce). In such case, probably they will not 
go through TCM and will be purely node-local.

Examples of things that are now configuable via yaml but will be configurable 
via TCM if we go ahead with this proposal: partitioner, memtable configuration, 
default compaction strategy, compression. As Sam has mentioned, "which specific 
value makes it into schema just depends on which instance acts as the 
coordinator for a given DCL statement".

bq. but I remain unconvinced that just picking the defaults from whatever node 
happens to be coordinating is the right way to go.

I have talked with Sam shortly just to make sure I understand it correctly 
before trying to describe it. Since this was first worded in a way that 
suggested a problem but not directly proposed a solution (possibly described 
elsewhere), I will attempt to do this. Sam has already described a part of the 
solution as:

bq. That should probably be in a parallel local datastructure though, not in 
the node's local log table as we don't want to ship those local defaults to 
peers when providing log catchup (because they should use their own defaults).

The part that was missing for me was where would the values be coming from, and 
what would be the precedence. When executing a {{CREATE}} statement on some 
node _without_ specifying, say, compression, the statement will be created and 
executed without the value for compression set at all. Every node will pick the 
value from its ephemeral parallel structure Sam described (which is also 
settable via JMX and alike like Stefan mentioned). If no value is present in 
this table, it will be picked from yaml (alternatively, we could just populate 
this structure from yaml, too, but I consider these things roughly equivalent).


was (Author: ifesdjeen):
bq. Yes, I think this is the most ideal solution. If somebody wants to 
experiment with a new compressor and similar, there would need to be some knob 
to override it, like some JMX method or similar, and all risks attached to that 
(divergence of the configuration caused by operator's negligence) would be on 
him.

Some things are actually quite useful for gradual rollout. For example, 
compression. You probably do not want to rewrite your sstables across the 
entire cluster. Similar arguments may be made for canary deployments of 
memtable settings and other things. 

I agree that it is fine if these parameters are completely transient (i.e. if 
you have set it to something that diverges from the clusterwide value, it will 
get reverted back after the node bounce). In such case, probably they will not 
go through TCM and will be purely node-local.

Examples of things that are now configuable via yaml but will be configurable 
via TCM if we go ahead with this proposal: partitioner, memtable configuration, 
default compaction strategy, compression. As Sam has mentioned, "which specific 
value makes it into schema just depends on which instance acts as the 
coordinator for a given DCL statement".

bq. but I remain unconvinced that just picking the defaults from whatever node 
happens to be coordinating is the right way to go.

I have talked with Sam shortly just to make sure I understand it correctly 
before trying to describe it. Since this was first worded in a way that 
suggested a problem but not directly proposed a solution (possibly described 
elsewhere), I will attempt to do this. Sam has already described a part of the 
solution as:

bq. That should probably be in a parallel local datastructure though, not in 
the node's local log table as we don't want to ship those local defaults to 
peers when providing log catchup (because they should use their own defaults).

The part that was missing for me was where would the values be coming from, and 
what would be the precedence. When executing a {CREATE} statement on some node 
_without_ specifying, say, compression, the statement will be created and 

[jira] [Updated] (CASSANDRA-19517) Raise priority of TCM internode messages during critical operations

2024-04-11 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19517:

Test and Documentation Plan: Includes tests. Additional stress testing will 
be done during release qualification.
 Status: Patch Available  (was: Open)

> Raise priority of TCM internode messages during critical operations
> ---
>
> Key: CASSANDRA-19517
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19517
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> In a busy cluster, TCM messages may not get propagated throughout the 
> cluster, since they will be ordered together with other P1 messages (for 
> {{TCM_}} prefixed verbs), and with P2 with all Paxos operations.
> To avoid this, and make sure we can continue cluster metadata changes, all 
> {{TCM_}}-prefixed verbs should have {{P0}} priority, just like Gossip 
> messages used to. All Paxos messages that involve distributed metadata 
> keyspace should now get an {{URGENT}} flag, which will instruct internode 
> messaging to schedule them on the {{URGENT_MESSAGES}} connection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-19517) Raise priority of TCM internode messages during critical operations

2024-04-11 Thread Alex Petrov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-19517:

Attachment: result_details.tar.gz

> Raise priority of TCM internode messages during critical operations
> ---
>
> Key: CASSANDRA-19517
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19517
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Transactional Cluster Metadata
>Reporter: Alex Petrov
>Assignee: Alex Petrov
>Priority: Normal
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> In a busy cluster, TCM messages may not get propagated throughout the 
> cluster, since they will be ordered together with other P1 messages (for 
> {{TCM_}} prefixed verbs), and with P2 with all Paxos operations.
> To avoid this, and make sure we can continue cluster metadata changes, all 
> {{TCM_}}-prefixed verbs should have {{P0}} priority, just like Gossip 
> messages used to. All Paxos messages that involve distributed metadata 
> keyspace should now get an {{URGENT}} flag, which will instruct internode 
> messaging to schedule them on the {{URGENT_MESSAGES}} connection.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



  1   2   3   4   5   6   7   8   9   10   >