[jira] [Commented] (CASSANDRA-19693) Relax slow_query_log_timeout for MultiNodeSAITest
[ https://issues.apache.org/jira/browse/CASSANDRA-19693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17853898#comment-17853898 ] Alex Petrov commented on CASSANDRA-19693: - +1, LGTM. Thank you the patch! > Relax slow_query_log_timeout for MultiNodeSAITest > - > > Key: CASSANDRA-19693 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19693 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SAI, Test/fuzz >Reporter: Caleb Rackliffe >Assignee: Caleb Rackliffe >Priority: Normal > Fix For: 5.x > > Time Spent: 0.5h > Remaining Estimate: 0h > > To stress the paging subsystem, we intentionally use a comically low fetch > size in {{{}MultiNodeSAITest{}}}. This can lead to some very slow queries > when we get matches into the hundreds of rows. It looks like CASSANDRA-19534 > has gotten a little more aggressive about how the slow query timeout is > triggered, and there’s a lot of noise around this in the logs, even in local > runs. I think bumping the default slow query timeout and perhaps the native > protocol timeout a bit should clear this up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore
[ https://issues.apache.org/jira/browse/CASSANDRA-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19695: Test and Documentation Plan: Includes a test Status: Patch Available (was: Open) > Accord Jounral Simulation: Add instrumentation for Semaphore > > > Key: CASSANDRA-19695 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19695 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore
[ https://issues.apache.org/jira/browse/CASSANDRA-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19695: Bug Category: Parent values: Code(13163)Level 1 values: Bug - Unclear Impact(13164) Complexity: Normal Component/s: Accord Discovered By: Code Inspection Severity: Low Status: Open (was: Triage Needed) > Accord Jounral Simulation: Add instrumentation for Semaphore > > > Key: CASSANDRA-19695 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19695 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19695) Accord Jounral Simulation: Add instrumentation for Semaphore
Alex Petrov created CASSANDRA-19695: --- Summary: Accord Jounral Simulation: Add instrumentation for Semaphore Key: CASSANDRA-19695 URL: https://issues.apache.org/jira/browse/CASSANDRA-19695 Project: Cassandra Issue Type: Bug Reporter: Alex Petrov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19694) Make Accord timestamps strictly monotonic
[ https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19694: Test and Documentation Plan: Covered by existing tests Status: Patch Available (was: Open) > Make Accord timestamps strictly monotonic > - > > Key: CASSANDRA-19694 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19694 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19694) Make Accord timestamps strictly monotonic
[ https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-19694: --- Assignee: Alex Petrov > Make Accord timestamps strictly monotonic > - > > Key: CASSANDRA-19694 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19694 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19694) Make Accord timestamps strictly monotonic
[ https://issues.apache.org/jira/browse/CASSANDRA-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19694: Bug Category: Parent values: Correctness(12982)Level 1 values: Unrecoverable Corruption / Loss(13161) Complexity: Low Hanging Fruit Discovered By: Code Inspection Severity: Critical Status: Open (was: Triage Needed) > Make Accord timestamps strictly monotonic > - > > Key: CASSANDRA-19694 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19694 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Priority: Normal > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19694) Make Accord timestamps strictly monotonic
Alex Petrov created CASSANDRA-19694: --- Summary: Make Accord timestamps strictly monotonic Key: CASSANDRA-19694 URL: https://issues.apache.org/jira/browse/CASSANDRA-19694 Project: Cassandra Issue Type: Bug Components: Accord Reporter: Alex Petrov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19662) Data Corruption and OOM Issues During Schema Alterations
[ https://issues.apache.org/jira/browse/CASSANDRA-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19662: Component/s: Cluster/Schema (was: Client/java-driver) > Data Corruption and OOM Issues During Schema Alterations > - > > Key: CASSANDRA-19662 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19662 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: BHARATH KUMAR >Priority: Urgent > Attachments: BufferUnderflow_plus_error > > > h2. Description > > *Overview:* The primary issue is data corruption occurring during schema > alterations (ADD/DROP column) on large tables(300+ columns and 6TB size ) in > the production cluster. This is accompanied by out-of-memory (OOM) errors and > other exceptions, specifically during batch reads. This problem has been > replicated on multiple clusters, running Apache Cassandra version 4.0.12 and > Datastax Java Driver Version: 4.17 > *Details:* > *Main Issue:* > * *Data Corruption:* When dynamically adding a column to a table, the data > intended for the new column is shifted, causing misalignment in the data. > * *Symptoms:* The object implementing > {{com.datastax.oss.driver.api.core.cql.Row}} returns values shifted against > the column names returned by {{{}row.getColumnDefinitions(){}}}. The driver > returns a corrupted row, leading to incorrect data insertion. > *Additional Issues:* > *Exceptions:* > * {{java.nio.BufferUnderflowException}} during batch reads when ALTER TABLE > ADD/DROP column statements are issued. > * {{java.lang.ArrayIndexOutOfBoundsException}} in some cases. > * Buffer underflow exceptions with messages like "Invalid 32-bits integer > value, expecting 4 bytes but got 292". > * OOM errors mostly occur during ADD column operations, while other > exceptions occur during DELETE column operations. > * *Method Specific:* Errors occur specifically with > {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. > *Reproducibility:* > * The issue is reproducible on larger tables (300 columns, 6 TB size) but > not on smaller tables. > * SELECT * statements are used during reads > * *Method Specific:* Errors occur specifically with > {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. > However, the code registers a driver exception when calling the method > {{{}row.getList(columnName, Float.class){}}}. We pass the exact column name > obtained from {{{}row.getColumnDefinition{}}}, but it returns the wrong value > for a column with this name. This suggests that the issue lies with the > driver returning an object with incorrect properties, rather than with the > SQL query itself. > *Debugging Efforts:* > * *Metadata Refresh:* Enabling metadata refresh did not resolve the issue. > * *Schema Agreement:* {{session.getCqlSession().checkSchemaAgreement()}} did > not detect inconsistencies during test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Fix Version/s: 5.1-alpha1 Since Version: 5.1-alpha1 Source Control Link: https://github.com/apache/cassandra/commit/b0ca509e7add760d187fcc5a9908d93d7c4fd6ec Resolution: Fixed Status: Resolved (was: Ready to Commit) > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Status: Ready to Commit (was: Review In Progress) Based on Aleksey's +1 on both patches, merging. > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850974#comment-17850974 ] Alex Petrov edited comment on CASSANDRA-19664 at 5/31/24 9:39 AM: -- [~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem to be related to {{add-opens}}; three dtest failures are unrelated. was (Author: ifesdjeen): [~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem to be related to {add-opens}; three dtest failures are unrelated. > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19215: Status: Open (was: Patch Available) > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19215: Resolution: Fixed Status: Resolved (was: Open) > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-19215: --- Assignee: Alex Petrov > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851018#comment-17851018 ] Alex Petrov commented on CASSANDRA-19215: - This should be fixed by [CASSANDRA-19534]. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Since Version: 3.0.0 (was: 4.1.5) > Unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Since Version: 4.1.5 Source Control Link: https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a Resolution: Fixed Status: Resolved (was: Ready to Commit) Committed to 4.1 with [dc17c29724d86547538cc8116ff1a90d36a0bf3a|https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a] and merged up to [5.0|https://github.com/apache/cassandra/commit/617a75843c9bfaf241249514f9604466f6c8ccab] and [trunk|https://github.com/apache/cassandra/commit/d10008d54bfb301ba12d022037b1caf78f18418b]. > Unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Status: Ready to Commit (was: Review In Progress) > Unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Attachment: ci_summary-1.html > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850974#comment-17850974 ] Alex Petrov commented on CASSANDRA-19664: - [~aleksey] uploaded the latest CI run; there are some JDK17 failures that seem to be related to {add-opens}; three dtest failures are unrelated. > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) Unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Summary: Unbounded queues in native transport requests lead to node instability (was: unbounded queues in native transport requests lead to node instability) > Unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Attachment: ci_summary.html > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Reviewers: Aleksey Yeschenko, Alex Petrov Status: Review In Progress (was: Patch Available) > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Test and Documentation Plan: Covered by existing tests in part; more tests coming with a follow-up patch Status: Patch Available (was: Open) > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19664) Accord Journal Determinism: PreAccept replay stability
[ https://issues.apache.org/jira/browse/CASSANDRA-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19664: Bug Category: Parent values: Correctness(12982)Level 1 values: Unrecoverable Corruption / Loss(13161) Complexity: Normal Component/s: Accord Discovered By: Code Inspection Severity: Critical Status: Open (was: Triage Needed) > Accord Journal Determinism: PreAccept replay stability > --- > > Key: CASSANDRA-19664 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 > Project: Cassandra > Issue Type: Bug > Components: Accord >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > Currently, some messages, such as PreAccept can have some of their context > initialized on replay. This patch adds a concept of Context to Journal that > can be used for arbitrary information necessary for replaying them just the > way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19662) Data Corruption and OOM Issues During Schema Alterations
[ https://issues.apache.org/jira/browse/CASSANDRA-19662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850061#comment-17850061 ] Alex Petrov commented on CASSANDRA-19662: - [~kumarbharath] which Cassandra version are you using? > Data Corruption and OOM Issues During Schema Alterations > - > > Key: CASSANDRA-19662 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19662 > Project: Cassandra > Issue Type: Bug > Components: Client/java-driver >Reporter: BHARATH KUMAR >Priority: Urgent > Attachments: BufferUnderflow_plus_error > > > h2. Description > > *Overview:* The primary issue is data corruption occurring during schema > alterations (ADD/DROP column) on large tables(300+ columns and 6TB size ) in > the production cluster. This is accompanied by out-of-memory (OOM) errors and > other exceptions, specifically during batch reads. This problem has been > replicated on multiple clusters, running Apache Cassandra version 4.0.12 and > Datastax Java Driver Version: 4.17 > *Details:* > *Main Issue:* > * *Data Corruption:* When dynamically adding a column to a table, the data > intended for the new column is shifted, causing misalignment in the data. > * *Symptoms:* The object implementing > {{com.datastax.oss.driver.api.core.cql.Row}} returns values shifted against > the column names returned by {{{}row.getColumnDefinitions(){}}}. The driver > returns a corrupted row, leading to incorrect data insertion. > *Additional Issues:* > *Exceptions:* > * {{java.nio.BufferUnderflowException}} during batch reads when ALTER TABLE > ADD/DROP column statements are issued. > * {{java.lang.ArrayIndexOutOfBoundsException}} in some cases. > * Buffer underflow exceptions with messages like "Invalid 32-bits integer > value, expecting 4 bytes but got 292". > * OOM errors mostly occur during ADD column operations, while other > exceptions occur during DELETE column operations. > * *Method Specific:* Errors occur specifically with > {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. > *Reproducibility:* > * The issue is reproducible on larger tables (300 columns, 6 TB size) but > not on smaller tables. > * SELECT * statements are used during reads > * *Method Specific:* Errors occur specifically with > {{{}row.getList(columnName, Float.class){}}}, returning incorrect values. > However, the code registers a driver exception when calling the method > {{{}row.getList(columnName, Float.class){}}}. We pass the exact column name > obtained from {{{}row.getColumnDefinition{}}}, but it returns the wrong value > for a column with this name. This suggests that the issue lies with the > driver returning an object with incorrect properties, rather than with the > SQL query itself. > *Debugging Efforts:* > * *Metadata Refresh:* Enabling metadata refresh did not resolve the issue. > * *Schema Agreement:* {{session.getCqlSession().checkSchemaAgreement()}} did > not detect inconsistencies during test execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19663) trunk fails to start
[ https://issues.apache.org/jira/browse/CASSANDRA-19663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850060#comment-17850060 ] Alex Petrov commented on CASSANDRA-19663: - > Is something else needed for 5.1? There should not be anything different needed for trunk. It also seems to build on CI as recently as today. > trunk fails to start > > > Key: CASSANDRA-19663 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19663 > Project: Cassandra > Issue Type: Bug >Reporter: Jon Haddad >Priority: Normal > > On commit {{6701259bce91672a7c3ca9fb77ea7b040e9c}}, I get errors on > startup. > Verified the build was successful: > {noformat} > easy-cass-lab.amazon-ebs.ubuntu: BUILD SUCCESSFUL > easy-cass-lab.amazon-ebs.ubuntu: Total time: 1 minute 41 seconds > {noformat} > Running on a new Ubuntu instance: > {noformat} > INFO [main] 2024-05-24 18:31:16,397 YamlConfigurationLoader.java:103 - > Configuration location: file:/usr/local/cassandra/trunk/conf/cassandra.yaml > ERROR [main] 2024-05-24 18:31:16,470 CassandraDaemon.java:900 - Exception > encountered during startup > java.lang.NoSuchMethodError: 'void > org.yaml.snakeyaml.LoaderOptions.setCodePointLimit(int)' > at > org.apache.cassandra.config.YamlConfigurationLoader.getDefaultLoaderOptions(YamlConfigurationLoader.java:433) > at > org.apache.cassandra.config.YamlConfigurationLoader$CustomConstructor.(YamlConfigurationLoader.java:278) > at > org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:135) > at > org.apache.cassandra.config.YamlConfigurationLoader.loadConfig(YamlConfigurationLoader.java:116) > at > org.apache.cassandra.config.DatabaseDescriptor.loadConfig(DatabaseDescriptor.java:403) > at > org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:265) > at > org.apache.cassandra.config.DatabaseDescriptor.daemonInitialization(DatabaseDescriptor.java:250) > at > org.apache.cassandra.service.CassandraDaemon.applyConfig(CassandraDaemon.java:781) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:724) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:878) > {noformat} > Running on Java 17: > {noformat} > ubuntu@cassandra0:~$ java -version > openjdk version "17.0.10" 2024-01-16 > OpenJDK Runtime Environment (build 17.0.10+7-Ubuntu-122.04.1) > OpenJDK 64-Bit Server VM (build 17.0.10+7-Ubuntu-122.04.1, mixed mode, > sharing) > {noformat} > Built with 11. > The only configs I changed: > {noformat} > cluster_name: "system_views" > num_tokens: 4 > seed_provider: > class_name: "org.apache.cassandra.locator.SimpleSeedProvider" > parameters: > seeds: "10.0.0.225" > hints_directory: "/mnt/cassandra/hints" > data_file_directories: > - "/mnt/cassandra/data" > commitlog_directory: "/mnt/cassandra/commitlog" > concurrent_reads: 64 > concurrent_writes: 64 > trickle_fsync: true > endpoint_snitch: "Ec2Snitch" > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Status: Ready to Commit (was: Changes Suggested) > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Fix Version/s: 5.1-alpha1 Source Control Link: https://github.com/apache/cassandra/commit/2e05cd4c8dd22e458eb1d2dad9cd34936b470266 Resolution: Fixed Status: Resolved (was: Ready to Commit) > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: ci_summary-4.1.html > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849920#comment-17849920 ] Alex Petrov edited comment on CASSANDRA-19158 at 5/28/24 9:09 AM: -- [~samt] I think I have addressed all your comments, and got a CI run with one unrelated failure. Could you take another look? was (Author: ifesdjeen): [~samt] I think I have addressed all your comments, and got a clean CI now. Could you take another look? > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: ci_summary-2.html > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849920#comment-17849920 ] Alex Petrov commented on CASSANDRA-19158: - [~samt] I think I have addressed all your comments, and got a clean CI now. Could you take another look? > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary-2.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17849812#comment-17849812 ] Alex Petrov commented on CASSANDRA-19534: - [~e.dimitrova] I believe it does. I was just finishing up the trunk and 4.1 commits, and getting clean CI runs. I think it looks mostly good now. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, > ci_summary-trunk.html, ci_summary.html, image-2024-05-03-16-08-10-101.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, > screenshot-5.png, screenshot-6.png, screenshot-7.png, screenshot-8.png, > screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: ci_summary-trunk.html > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, > ci_summary-trunk.html, ci_summary.html, image-2024-05-03-16-08-10-101.png, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, > screenshot-5.png, screenshot-6.png, screenshot-7.png, screenshot-8.png, > screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: ci_summary-5.0.html > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-5.0.html, > ci_summary.html, image-2024-05-03-16-08-10-101.png, screenshot-1.png, > screenshot-2.png, screenshot-3.png, screenshot-4.png, screenshot-5.png, > screenshot-6.png, screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19664) Accord Jounral Determinism: PreAccept replay stability
Alex Petrov created CASSANDRA-19664: --- Summary: Accord Jounral Determinism: PreAccept replay stability Key: CASSANDRA-19664 URL: https://issues.apache.org/jira/browse/CASSANDRA-19664 Project: Cassandra Issue Type: Bug Reporter: Alex Petrov Assignee: Alex Petrov Currently, some messages, such as PreAccept can have some of their context initialized on replay. This patch adds a concept of Context to Journal that can be used for arbitrary information necessary for replaying them just the way they were executed the first time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17846953#comment-17846953 ] Alex Petrov commented on CASSANDRA-19592: - [~samt] looks good to me! > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: ci_summary.html > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: (was: ci_summary.html) > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: ci_summary-1.html > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 1h > Remaining Estimate: 0h > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19134) Avoid flushing on every append in the LocalLog
[ https://issues.apache.org/jira/browse/CASSANDRA-19134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-19134: --- Assignee: Aleksey Yeschenko (was: Alex Petrov) > Avoid flushing on every append in the LocalLog > -- > > Key: CASSANDRA-19134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19134 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Membership >Reporter: Marcus Eriksson >Assignee: Aleksey Yeschenko >Priority: Normal > Fix For: 5.1-alpha1 > > > Right now, we are performing flush on every transformation that is appended > to the local log. While this does make _some_ sense, it may not be what we > always want to do. We have initially added this flush as a way to remedy node > bounces following schema changes, but this should no longer be necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845935#comment-17845935 ] Alex Petrov commented on CASSANDRA-19592: - Updated the patch with comments from Sam, Marcus, and Stefan > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845930#comment-17845930 ] Alex Petrov commented on CASSANDRA-19534: - Pushed a new commit that should address your comments [~maedhroz] > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17845930#comment-17845930 ] Alex Petrov edited comment on CASSANDRA-19534 at 5/13/24 1:47 PM: -- [~maedhroz] thank you for the review! Pushed a new commit that should address your comments. was (Author: ifesdjeen): Pushed a new commit that should address your comments [~maedhroz] > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-17354) Bump java-driver dependency in Cassandra to latest 3.x series
[ https://issues.apache.org/jira/browse/CASSANDRA-17354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-17354: Resolution: Won't Fix Status: Resolved (was: Open) As per Abe's message > Bump java-driver dependency in Cassandra to latest 3.x series > -- > > Key: CASSANDRA-17354 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17354 > Project: Cassandra > Issue Type: Task > Components: Test/unit >Reporter: Alex Petrov >Priority: High > Fix For: 5.x > > > We depend on java-driver for testing, and developing/validating native > protocol changes. Unfortunately, the version of drvier that is included with > Cassandra is quite ancient: 3.0.1. We need to bump this dependency to latest > in 3.x series, without upgrading to 4.0 at least for now. Unfortunately, this > is not a trivial change in build.xml (otherwise I would’ve done it rather > than opening this ticket), and bumping version breaks a few tests in all > versions, so those need to be fixed, too. > This should be a prerequiste for the next minor version release, too. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-16135) Separate in-JVM test into smaller packages
[ https://issues.apache.org/jira/browse/CASSANDRA-16135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-16135: --- Assignee: (was: Alex Petrov) > Separate in-JVM test into smaller packages > -- > > Key: CASSANDRA-16135 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16135 > Project: Cassandra > Issue Type: Task > Components: Test/dtest/java >Reporter: Alex Petrov >Priority: High > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.0.x > > Time Spent: 20m > Remaining Estimate: 0h > > Introduce a structure similar to how tags are organised in Cassandra Jira for > corresponding in-jvm dtests to help people find a right place for their tests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-19215: --- Assignee: (was: Alex Petrov) > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844258#comment-17844258 ] Alex Petrov commented on CASSANDRA-19215: - This is now largely superseded by work on [CASSANDRA-19534], as I have posted the patch there. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-13138) SASI tries to fetch an extra page when resultset size is same size as page size
[ https://issues.apache.org/jira/browse/CASSANDRA-13138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-13138: --- Assignee: (was: Alex Petrov) > SASI tries to fetch an extra page when resultset size is same size as page > size > --- > > Key: CASSANDRA-13138 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13138 > Project: Cassandra > Issue Type: Bug > Components: Feature/SASI >Reporter: Alex Petrov >Priority: Normal > > For example, in a dataset that would return 10 rows, SASI would try (and > return an empty page) to fetch the next page, while filtering and 2i will > return results correctly: > {code} > pk | ck1 | ck2 | reg1 | reg2 | reg3 > +-+-+--+--+-- > 6 | 5 | 5 |5 |5 | 10 > 7 | 5 | 5 |5 |5 | 10 > 9 | 5 | 5 |5 |5 | 10 > 4 | 5 | 5 |5 |5 | 10 > 3 | 5 | 5 |5 |5 | 10 > 5 | 5 | 5 |5 |5 | 10 > 0 | 5 | 5 |5 |5 | 10 > 8 | 5 | 5 |5 |5 | 10 > 2 | 5 | 5 |5 |5 | 10 > 1 | 5 | 5 |5 |5 | 10 > ---MORE--- > (10 rows) > {code} > (that {{--MORE--}} shouldn't have been there) > This might be an inherent limitation, although even if it is we can opt out > for fetching limit+1 if the data limits aren't exhausted. Although it seems > that there should be a solution for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-15413) Missing results on reading large frozen text map
[ https://issues.apache.org/jira/browse/CASSANDRA-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-15413: --- Assignee: (was: Alex Petrov) > Missing results on reading large frozen text map > > > Key: CASSANDRA-15413 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15413 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Tyler Codispoti >Priority: Normal > > Cassandra version: 2.2.15 > I have been running into a case where, when fetching the results from a table > with a frozen>, if the number of results is greater than the > fetch size (default 5000), we can end up with missing data. > Side note: The table schema comes from using KairosDB, but we've isolated > this issue to Cassandra itself. But it looks like this can cause problems for > users of KairosDB as well. > Repro case. Tested against fresh install of Cassandra 2.2.15. > 1. Create table (csqlsh) > {code:sql} > CREATE KEYSPACE test > WITH REPLICATION = { >'class' : 'SimpleStrategy', >'replication_factor' : 1 > }; > CREATE TABLE test.test ( > name text, > tags frozen>, > PRIMARY KEY (name, tags) > ) WITH CLUSTERING ORDER BY (tags ASC); > {code} > 2. Insert data (python3) > {code:python} > import time > from cassandra.cluster import Cluster > cluster = Cluster(['127.0.0.1']) > session = cluster.connect('test') > for i in range(0, 2): > session.execute( > """ > INSERT INTO test (name, tags) > VALUES (%s, %s) > """, > ("test_name", {'id':str(i)}) > ) > {code} > > 3. Flush > > {code:java} > nodetools flush{code} > > > 4. Fetch data (python3) > {code:python} > import time > from cassandra.cluster import Cluster > cluster = Cluster(['127.0.0.1'], control_connection_timeout=5000) > session = cluster.connect('test') > session.default_fetch_size = 5000 > session.default_timeout = 120 > count = 0 > rows = session.execute("select tags from test where name='test_name'") > for row in rows: > count += 1 > print(count) > {code} > Result: 10111 (expected 2) > > Changing the page size changes the result count. Some quick samples: > > ||default_fetch_size||count|| > |5000|10111| > |1000|1830| > |999|1840| > |998|1850| > |2|2| > |10|2| > > > In short, I cannot guarantee I'll get all the results back unless the page > size > number of rows. > This seems to get worse with multiple SSTables (eg nodetool flush between > some of the insert batches). When using replication, the issue can get > disgustingly bad - potentially giving a different result on each query. > Interesting, if we pad the values on the tag map ("id" in this repro case) so > that the insertion is in lexicographical order, there is no issue. I believe > the issue also does not repro if I do not call "nodetools flush" before > querying. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-13478) SASI Sparse mode overflow corrupts the SSTable
[ https://issues.apache.org/jira/browse/CASSANDRA-13478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-13478: --- Assignee: (was: Alex Petrov) > SASI Sparse mode overflow corrupts the SSTable > -- > > Key: CASSANDRA-13478 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13478 > Project: Cassandra > Issue Type: Bug > Components: Feature/SASI > Environment: cqlsh 5.0.1 | Cassandra 3.10 | CQL spec 3.4.4 | Native > protocol v4 | ubuntu 14.04 >Reporter: jack chen >Priority: Low > Attachments: schema > > > I have a table, the schema can be seen in attach file > I would like to search the data using the timestamp data type with lt gt eq > as a query condition, > Ex: > {code} > CREATE TABLE XXX.userlist ( > userid text PRIMARY KEY, > lastposttime timestamp > ) > Select * from userlist where lastposttime> '2017-04-01 16:00:00+'; > {code} > There are 2 conditions : > If I insert the data and then select it, the result will be correct > But in case I insert data and then the next day I restart Cassandra, and > after that select the data, there will be no data selected > The difference is that there is no Service restart on th next day in the > first manner. Actually, the data are still living in Cassandra, but TimeStamp > can’t be used as the query condition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-13243) testall failure in org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables-compression
[ https://issues.apache.org/jira/browse/CASSANDRA-13243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-13243: --- Assignee: (was: Alex Petrov) > testall failure in > org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables-compression > --- > > Key: CASSANDRA-13243 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13243 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Testing >Reporter: Sean McCarthy >Priority: Normal > Labels: test-failure, testall > Attachments: TEST-org.apache.cassandra.index.sasi.SASIIndexTest.log > > > example failure: > http://cassci.datastax.com/job/trunk_testall/1412/testReport/org.apache.cassandra.index.sasi/SASIIndexTest/testMultiExpressionQueriesWhereRowSplitBetweenSSTables_compression > {code} > Error Message > [key0, key11, key12, key13, key14, key6, key7, key8] expected:<10> but was:<8> > {code}{code} > Stacktrace > junit.framework.AssertionFailedError: [key0, key11, key12, key13, key14, > key6, key7, key8] expected:<10> but was:<8> > at > org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables(SASIIndexTest.java:567) > at > org.apache.cassandra.index.sasi.SASIIndexTest.testMultiExpressionQueriesWhereRowSplitBetweenSSTables(SASIIndexTest.java:452) > {code}{code} > Standard Output > ERROR [main] 2017-02-17 23:02:40,404 ?:? - SLF4J: stderr > INFO [main] 2017-02-17 23:02:40,830 ?:? - Configuration location: > file:/home/automaton/cassandra/test/conf/cassandra-murmur.yaml > DEBUG [main] 2017-02-17 23:02:40,831 ?:? - Loading settings from > file:/home/automaton/cassandra/test/conf/cassandra-murmur.yaml > INFO [main] 2017-02-17 23:02:41,678 ?:? - Node > configuration:[allocate_tokens_for_keyspace=null; authenticator=null; > authorizer=null; auto_bootstrap=true; auto_snapshot=true; back_pres > ...[truncated 416882 chars]... > .957KiB), biggest 4.957KiB, smallest 4.957KiB > DEBUG [CompactionExecutor:3] 2017-02-17 23:03:16,787 ?:? - Compacted > (cb40-f565-11e6-8e91-7511b7f59d65) 4 sstables to > [/home/automaton/cassandra/build/test/cassandra/data:231/system/local-7ad54392bcdd35a684174e047860b377/md-85-big,] > to level=0. 0.466KiB to 0.258KiB (~55% of original) in 58ms. Read > Throughput = 7.914KiB/s, Write Throughput = 4.380KiB/s, Row Throughput = > ~2/s. 4 total partitions merged to 1. Partition merge counts were {4:1, } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843389#comment-17843389 ] Alex Petrov commented on CASSANDRA-19534: - These tests look really good! I haven’t expected the one patched node scenario to work that well but glad that it helps even in that case. Thank you for checking! > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 20m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17843375#comment-17843375 ] Alex Petrov commented on CASSANDRA-19534: - Thank you for looking into this. Just to make sure, patch works both on the coordinator and replica side, so it would make most sense to compare two clusters: one with a patch and one without. There might be some improvement if we only have one node using deadlines but then all three nodes will benefit from replica side shedding while coordinator side shedding will work just for one of them. I think having all nodes with a patch will have a more pronounced effect. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, > screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png > > Time Spent: 20m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842171#comment-17842171 ] Alex Petrov commented on CASSANDRA-19534: - This is great, thank you for testing! My 100s timeout was erring (probably too far) on the side of sticking to the old behaviour. I was slightly concerned that people will see timeouts and conclude this is not something they want. But unfortunately there’s no way for us to produce reasonable workload balance without shedding some load and timing out lagging requests. I will update a default to 12s. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842112#comment-17842112 ] Alex Petrov commented on CASSANDRA-19534: - [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842112#comment-17842112 ] Alex Petrov edited comment on CASSANDRA-19534 at 4/29/24 5:24 PM: -- [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests against the branch posted above? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. was (Author: ifesdjeen): [~brandon.williams] [~rustyrazorblade] would you be so kind to try running your tests? I suggest setting {{native_transport_timeout_in_ms}} to about 10 (or 12 max) seconds, and {{internode_timeout}} to {{true}} for starters. If you really want to push the limits, I'd suggest setting {{cql_start_time}} to {{REQUEST}}, but this is optional, as we will not roll it out with this setting enabled. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Test and Documentation Plan: Includes tests, also was tested separately; screenshots and description attached Status: Patch Available (was: Open) > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > Time Spent: 10m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: ci_summary.html > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19158: Attachment: ci_summary.html > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841204#comment-17841204 ] Alex Petrov commented on CASSANDRA-19592: - Compact storage related test is now fixed in the pushed version. > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19592: Attachment: ci_summary.html > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html > > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19221: Since Version: 5.0-alpha1 Source Control Link: https://github.com/apache/cassandra/commit/38512a469cef06770384423d0b30e3e85b511258 Resolution: Fixed Status: Resolved (was: Ready to Commit) > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19592: Test and Documentation Plan: Tests included Status: Patch Available (was: Open) > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
[ https://issues.apache.org/jira/browse/CASSANDRA-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19592: Bug Category: Parent values: Correctness(12982)Level 1 values: API / Semantic Implementation(12988) Complexity: Normal Component/s: Cluster/Schema Discovered By: Code Inspection Severity: Normal Status: Open (was: Triage Needed) > Expand CREATE TABLE CQL on a coordinating node before submitting to CMS > --- > > Key: CASSANDRA-19592 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > This is done to unblock CASSANDRA-12937 and allow preserving defaults with > which the table was created between node bounces and between nodes with > different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-19592) Expand CREATE TABLE CQL on a coordinating node before submitting to CMS
Alex Petrov created CASSANDRA-19592: --- Summary: Expand CREATE TABLE CQL on a coordinating node before submitting to CMS Key: CASSANDRA-19592 URL: https://issues.apache.org/jira/browse/CASSANDRA-19592 Project: Cassandra Issue Type: Bug Reporter: Alex Petrov Assignee: Alex Petrov This is done to unblock CASSANDRA-12937 and allow preserving defaults with which the table was created between node bounces and between nodes with different configurations. For now, we are preserving 5.0 behaviour. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841183#comment-17841183 ] Alex Petrov commented on CASSANDRA-12937: - Yes, tried this locally, wrote a bunch of tests, patch coming up as soon as python dtests wrap up! And yes, it seemed like we should just replicate what 5.0 does right now, and it implicitly does this via schema mutations. Schema mutations are created using coordinating node's defaults. Since we are not using schema mutations in 5.1 anymore, I thought expanding CQL is the second best option. > Default setting (yaml) for SSTable compression > -- > > Key: CASSANDRA-12937 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12937 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Michael Semb Wever >Assignee: Stefan Miklosovic >Priority: Low > Labels: AdventCalendar2021 > Fix For: 5.x > > Time Spent: 8h > Remaining Estimate: 0h > > In many situations the choice of compression for sstables is more relevant to > the disks attached than to the schema and data. > This issue is to add to cassandra.yaml a default value for sstable > compression that new tables will inherit (instead of the defaults found in > {{CompressionParams.DEFAULT}}. > Examples where this can be relevant are filesystems that do on-the-fly > compression (btrfs, zfs) or specific disk configurations or even specific C* > versions (see CASSANDRA-10995 ). > +Additional information for newcomers+ > Some new fields need to be added to {{cassandra.yaml}} to allow specifying > the field required for defining the default compression parameters. In > {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for > the default compression. This field should be initialized in > {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where > {{CompressionParams.DEFAULT}} was used the code should call > {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some > copy of configured {{CompressionParams}}. > Some unit test using {{OverrideConfigurationLoader}} should be used to test > that the table schema use the new default when a new table is created (see > CreateTest for some example). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841157#comment-17841157 ] Alex Petrov commented on CASSANDRA-12937: - Looks like it is possible to solve this problem for now in a much simpler way. We can simply fully expand the {{CREATE TABLE}} on the coordinator and achieve persistence of arguments. I think we will need a CEP for a more sophisticated approach, which we probably should leave for later. > Default setting (yaml) for SSTable compression > -- > > Key: CASSANDRA-12937 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12937 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Michael Semb Wever >Assignee: Stefan Miklosovic >Priority: Low > Labels: AdventCalendar2021 > Fix For: 5.x > > Time Spent: 8h > Remaining Estimate: 0h > > In many situations the choice of compression for sstables is more relevant to > the disks attached than to the schema and data. > This issue is to add to cassandra.yaml a default value for sstable > compression that new tables will inherit (instead of the defaults found in > {{CompressionParams.DEFAULT}}. > Examples where this can be relevant are filesystems that do on-the-fly > compression (btrfs, zfs) or specific disk configurations or even specific C* > versions (see CASSANDRA-10995 ). > +Additional information for newcomers+ > Some new fields need to be added to {{cassandra.yaml}} to allow specifying > the field required for defining the default compression parameters. In > {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for > the default compression. This field should be initialized in > {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where > {{CompressionParams.DEFAULT}} was used the code should call > {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some > copy of configured {{CompressionParams}}. > Some unit test using {{OverrideConfigurationLoader}} should be used to test > that the table schema use the new default when a new table is created (see > CreateTest for some example). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840173#comment-17840173 ] Alex Petrov edited comment on CASSANDRA-19534 at 4/24/24 7:17 AM: -- Sorry for the lack of clarity; before this patch, there was no deadline at all. Tasks will live in the system essentially forever clogging queues doing busy work. I was intending to post a patch but it is currently in my CI queue; however otherwise ready to go. I believe with 12 seconds default, users will only see an improvement and there will be no learning curve at all. All configuration options are for the people who understand their request lifetimes and want to get an even better profile. was (Author: ifesdjeen): Sorry for the lack of clarity; today there’s no deadline at all. Tasks will live in the system essentially forever clogging queues doing busy work. I was intending to post a patch but it is currently in my CI queue; however otherwise ready to go. i believe with 12 seconds default, users will only see an improvement and there will be no learning curve at all. All configurable are for the people who understand their request lifetimes and want to get an even better profile. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840204#comment-17840204 ] Alex Petrov commented on CASSANDRA-19221: - Addressed your comments [~samt], both failures are timeouts that are unrelated to the patch. I believe we should split the {{MetadataChangeSimulationTest}} since after adding transient tests it seems to sometimes cross the timeout deadline. > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19221: Attachment: ci_summary-1.html > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840173#comment-17840173 ] Alex Petrov commented on CASSANDRA-19534: - Sorry for the lack of clarity; today there’s no deadline at all. Tasks will live in the system essentially forever clogging queues doing busy work. I was intending to post a patch but it is currently in my CI queue; however otherwise ready to go. i believe with 12 seconds default, users will only see an improvement and there will be no learning curve at all. All configurable are for the people who understand their request lifetimes and want to get an even better profile. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: Scenario 2 - QUEUE + Backpressure.jpg Scenario 2 - QUEUE.jpg Scenario 2 - Stock.jpg > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19534: Attachment: Scenario 1 - QUEUE.jpg Scenario 1 - QUEUE + Backpressure.jpg Scenario 1 - Stock.jpg > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840058#comment-17840058 ] Alex Petrov commented on CASSANDRA-19534: - The main change is the introduction of (currently implicit) configurable {_}native request deadline{_}. No request, read or write, will be allowed to prolong its execution beyond this deadline. Some of the hidden places that would allow requests to stay overdue were local executor runnables, replica-side writes, and hints. Default is 12 seconds, since this is how much time 3.x driver (which I believe is still the most used version in the community) waits until removing its handlers after which any response from the server will just be ignored. Now, there is an _option_ to enable expiration based on the queue time, which will be _disabled_ by default to preserve existing semantics, but my tests have shown enabling it only has positive effects. We will try it out cautiously in different clusters over the next months and will see if tests match up with real loads before we change any of the defaults. So by default behaviour will be as follows: # If request has spent more than 12 seconds in the NATIVE queue, we throw Overloaded exception back to the client. This timeout used to be max of read/write/range/counter rpc timeout. # If requests has spent less than 12 seconds, it is allowed to execute; any request issued by the coordinator can live: ## _either_ {{Verb.timeout}} number of milliseconds, ## _or_ up to the up to the native request deadline, as measured from the time when the request was admitted to the coordinator's NATIVE queue, whichever one of these is happening earlier. Example 1, read timeout is 5 seconds: # Client sends a request; request spends 6 seconds in the NATIVE queue # Coordinator issues requests to replicas; two replicas respond within 3 seconds # Coordinator responds to the client with success Example 2, read timeout is 5 seconds: # Client sends a request; request spends 6 seconds in the NATIVE queue # Coordinator issues requests to replicas; one replica responds within 3 seconds; other replicas fail to respond within 5 seconds of read timeout # Coordinator responds to the client with read timeout (preserves current behaviour) Example 3, read timeout is 5 seconds: # Client sends a request; request spends 10 seconds in the NATIVE queue # Coordinator issues requests to replicas; all replicas fail to respond within 2 seconds # Coordinator responds to the client with read timeout; if messages are still in queue on replicas, they will get dropped before processing There will be a _new_ metric that shows how many of the timeouts would have been “blind timeouts” previously. I.e. client _would_ register them as timeouts, but we as server-side operators would be oblivious to them. This metric will keep us collectively motivated even if we see there is a slight uptick in timeouts after committing the patch. Lastly, there is an option to say how much of the 12 seconds client requests are allowed to spend in the native queue. You can say that if there is a client request that has spent 80% of its max 12 seconds in the native queue, we start applying backpressure to the client socket (or throwing overloaded exception, depending on the value of {{{}native_transport_throw_on_overload{}}}). We have to be careful with enabling this one, since my tests have shown that while we see fewer timeouts server side, clients see more timeouts, because part of the time they consider “request time” is now spent somewhere in TCP queues, which we can not account for. h3. New Configuration Params h3. cql_start_time Configures what is considered to be a base for the replica-side timeout. This has actually existed before, it is now actually safe to enable. It still defaults to {{REQUEST}} (processing start time is taken as a timeout base), and an alternative is {{QUEUE}} (queue admission time is taken as a timeout base). Unfortunately, there is no consistent view of the timeout base in the community: some people think that server-side read/write timeouts are how much time _replicas_ have to respond to coordinator. Some believe they mean how much time _coordinator_ has to respond to the client. This patch is agnostic to these beliefs. h3. native_transport_throw_on_overload Whether we should apply backpressure to client (i.e. stop reading from the socket), or throw Overloaded exception. Default is socket backpressure, and this is probably fine for now. In principle, this can also be set by the client on per-connection basis via protocol options. However, 3.x series of the driver do not have this addition implemented, so in practice this is not really used. If used, setting from the client takes precedence. h3. native_transport_timeout_in_ms The absolute maximum amount of time the server has to respond to
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838960#comment-17838960 ] Alex Petrov commented on CASSANDRA-19534: - I guess this can explain it. We have 32 read, 32 write threads, and 128 native threads, so 2:1 relation. Read queue is slightly deeper (about 80) requests, which is clear since latency there is probably higher (however depends on the request), and write queue is almost empty. We easily can have all 128 requests blocked in this case, so they can not really overload the downstream stages. Besides, there's no hints, so at least a part of the issue we may have in a distributed environment is not applicable. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets
[ https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838910#comment-17838910 ] Alex Petrov commented on CASSANDRA-19344: - +1 > Range movements involving transient replicas must safely enact changes to > read and write replica sets > - > > Key: CASSANDRA-19344 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19344 > Project: Cassandra > Issue Type: Bug > Components: CI >Reporter: Ekaterina Dimitrova >Assignee: Sam Tunnicliffe >Priority: Normal > Fix For: 5.x > > Attachments: ci_summary-1.html, ci_summary.html, > remove-n4-post-19344.txt, remove-n4-pre-19344.txt, result_details.tar.gz > > Time Spent: 1h 40m > Remaining Estimate: 0h > > (edit) This was originally opened due to a flaky test > {{org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode-_jdk17}} > The test can fail in two different ways: > {code:java} > junit.framework.AssertionFailedError: NOT IN CURRENT: 31 -- [(00,20), > (31,50)] at > org.apache.cassandra.distributed.test.TransientRangeMovementTest.assertAllContained(TransientRangeMovementTest.java:203) > at > org.apache.cassandra.distributed.test.TransientRangeMovementTest.testRemoveNode(TransientRangeMovementTest.java:183) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43){code} > as in here - > [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/2639/workflows/32b92ce7-5e9d-4efb-8362-d200d2414597/jobs/55139/tests#failed-test-0] > and > {code:java} > junit.framework.AssertionFailedError: nodetool command [removenode, > 6d194555-f6eb-41d0-c000-0003, --force] was not successful stdout: > stderr: error: Node /127.0.0.4:7012 is alive and owns this ID. Use > decommission command to remove it from the ring -- StackTrace -- > java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and > owns this ID. Use decommission command to remove it from the ring at > org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110) > at > org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682) > at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at > org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at > org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388) > at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at > org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at > org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129) > at > org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038) > at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Thread.java:833) Notifications: Error: > java.lang.UnsupportedOperationException: Node /127.0.0.4:7012 is alive and > owns this ID. Use decommission command to remove it from the ring at > org.apache.cassandra.tcm.sequences.SingleNodeSequences.removeNode(SingleNodeSequences.java:110) > at > org.apache.cassandra.service.StorageService.removeNode(StorageService.java:3682) > at org.apache.cassandra.tools.NodeProbe.removeNode(NodeProbe.java:1020) at > org.apache.cassandra.tools.nodetool.RemoveNode.execute(RemoveNode.java:51) at > org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:388) > at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:373) at > org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:272) at > org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:1129) > at > org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$51(Instance.java:1038) > at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61) at > org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71) at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at >
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1783#comment-1783 ] Alex Petrov commented on CASSANDRA-19534: - Talked to [~brandon.williams] and checked the remains of the cluster in the bad state to observe that at least the symptoms match my own observations and the issue I have seen. 180+K of tasks in the Native queue. I am a bit surprised that read and write queues are almost empty (under 100 items in both), but depending on which node was coordinating this can be ok. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-19158) Reuse native transport-driven futures in Debounce
[ https://issues.apache.org/jira/browse/CASSANDRA-19158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov reassigned CASSANDRA-19158: --- Assignee: Alex Petrov > Reuse native transport-driven futures in Debounce > - > > Key: CASSANDRA-19158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19158 > Project: Cassandra > Issue Type: Improvement >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > > Currently, we create a future in Debounce, then create one more future in > RemoteProcessor#sendWithCallback. This is further exacerbated by chaining > calls, when we first attempt to catch up from peer, and then from CMS. > First of all, we should always only use a native transport timeout driven > futures returned from sendWithCallback, since they implement reasonable > retries under the hood, and are easy to bulk-configure (ie you can simply > change timeout in yaml and have all futures change their behaviour). > Second, we should _chain_ futures and use map or andThen for fallback > operations such as trying to catch up from CMS after an unsuccesful attemp to > catch up from peer. > This should significantly simplify the code and number of blocked/waiting > threads. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19514) When jvm-dtest is shutting down an instance TCM retries block the shutdown causing the test to fail
[ https://issues.apache.org/jira/browse/CASSANDRA-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838664#comment-17838664 ] Alex Petrov commented on CASSANDRA-19514: - +1 on the latest trunk patch! Thank you! > When jvm-dtest is shutting down an instance TCM retries block the shutdown > causing the test to fail > --- > > Key: CASSANDRA-19514 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19514 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership, Test/dtest/java >Reporter: David Capwell >Assignee: Sam Tunnicliffe >Priority: Normal > Fix For: 5.1 > > Attachments: ci_summary.html, result_details.tar.gz > > Time Spent: 10m > Remaining Estimate: 0h > > org.apache.cassandra.distributed.test.log.RequestCurrentEpochTest#testRequestingPeerWatermarks > {code} > java.lang.RuntimeException: java.util.concurrent.TimeoutException >org.apache.cassandra.utils.Throwables.maybeFail(Throwables.java:79) > > org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:540) > > org.apache.cassandra.distributed.impl.AbstractCluster.close(AbstractCluster.java:1098) > > org.apache.cassandra.distributed.test.log.RequestCurrentEpochTest.testRequestingPeerWatermarks(RequestCurrentEpochTest.java:77) >java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > Caused by: java.util.concurrent.TimeoutException > > org.apache.cassandra.utils.concurrent.AbstractFuture.get(AbstractFuture.java:253) > > org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:532) > Suppressed: java.util.concurrent.TimeoutException > {code} > In debugger I found the blocked future and it was > src/java/org/apache/cassandra/tcm/EpochAwareDebounce.java waiting on > src/java/org/apache/cassandra/tcm/RemoteProcessor.java retries -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838360#comment-17838360 ] Alex Petrov commented on CASSANDRA-19534: - Do you have observability data from the cluster per chance? Would you be able to maybe check out the Native, Read, and Write stages pending request counts? > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17838355#comment-17838355 ] Alex Petrov commented on CASSANDRA-19534: - I am a bit surprised to see that on 4.1 we seem to stabilize when errors begin. In essence, the problem is that request lifetime is unbounded. There are several contributing factors, such as lifetimes of local runnables, hints being re-submitted on the local mutation queue, and mutations on the replica side not respecting message expiration deadlines. I think most of these should have been present in 4.1, too. Unless, of course, there is more than one problem. I have initially discovered it pre-5.0 though. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19221: Test and Documentation Plan: Includes a test Status: Patch Available (was: Open) > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary.html > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19221: Attachment: ci_summary.html > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > Attachments: ci_summary.html > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837763#comment-17837763 ] Alex Petrov commented on CASSANDRA-19221: - I've had a closer look at it, and wanted to mention that 5.0 behaviour is most likely uninteded; it contains at least one bug, and is potentially dangeroud. In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, {{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes ({{.2}} and {{.3}}. As a result of this test, nodes have in fact swapped their IPs, but: * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then {{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to be occupied, so an entire test can work only if you start the two nodes in parallel * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} message for a specific IP address for a node, which it doesn't find if you merely change the address in the conf file * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} in its peers table. In general, since we are using ip addresses for node identity, I am weary of allowing identity transfers for the occupied pars. By this I mean if {{ip <-> node id}} pair exists in the directory, we have to free up the IP address before the other node can claim it. So the test would look as follows: So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to {{.4}} first, and only then can the freed up IP address be occupied again. Submitting a patch that fixes the peers table behaviour and codifies a requirement of a separate node for swapping addresses. > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 86.77 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 80.88 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > After: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 149.62 KiB 16 76.0% > 6d194555-f6eb-41d0-c000-0003 rack1 > UN 127.0.0.2 155.48 KiB 16 59.3% > 6d194555-f6eb-41d0-c000-0002 rack1 > UN 127.0.0.1 75.74 KiB 16 64.7% > 6d194555-f6eb-41d0-c000-0001 rack1 > {code} > On previous tests of this I have created a table with a replication factor of > 1, inserted some data before the swap. After the swap the data on nodes 2 > and 3 is now missing. > One theory I have is that I am using different port numbers for the different > nodes, and I am only swapping the ip addresses and not the port numbers, so > the ip:port still looks unique > i.e. 127.0.0.2:9043 becomes 127.0.0.2:9044 > and 127.0.0.3:9044 becomes 127.0.0.3:9043 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19221) CMS: Nodes can restart with new ipaddress already defined in the cluster
[ https://issues.apache.org/jira/browse/CASSANDRA-19221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837763#comment-17837763 ] Alex Petrov edited comment on CASSANDRA-19221 at 4/16/24 3:28 PM: -- I've had a closer look at it, and wanted to mention that 5.0 behaviour is most likely uninteded; it contains at least one bug, and is potentially dangeroud. In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, {{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes ({{.2}} and {{.3}}). As a result of this test, nodes have in fact swapped their IPs, but: * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then {{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to be occupied, so an entire test can work only if you start the two nodes in parallel * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} message for a specific IP address for a node, which it doesn't find if you merely change the address in the conf file * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} in its peers table. In general, since we are using ip addresses for node identity, I am weary of allowing identity transfers for the occupied pars. By this I mean if {{ip <-> node id}} pair exists in the directory, we have to free up the IP address before the other node can claim it. So the test would look as follows: So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to {{.4}} first, and only then can the freed up IP address be occupied again. Submitting a patch that fixes the peers table behaviour and codifies a requirement of a separate node for swapping addresses. was (Author: ifesdjeen): I've had a closer look at it, and wanted to mention that 5.0 behaviour is most likely uninteded; it contains at least one bug, and is potentially dangeroud. In short, my test was to spin up a 3 node cluster: {{127.0.0.1}}, {{127.0.0.2}}, {{127.0.0.3}}, and swap IP addresses for the two latter nodes ({{.2}} and {{.3}}. As a result of this test, nodes have in fact swapped their IPs, but: * if you would shut down {{.2}} and {{.3}}, and start {{.2}}, and then {{.3}}, {{.3}} startup won't even begin because ccm considers its IP address to be occupied, so an entire test can work only if you start the two nodes in parallel * after swapping ip addresses, ccm breaks, since it attempts to search {{UP}} message for a specific IP address for a node, which it doesn't find if you merely change the address in the conf file * peers table for {{.2}} whose address is now {{.3}} will still have {{.3}} in its peers table. In general, since we are using ip addresses for node identity, I am weary of allowing identity transfers for the occupied pars. By this I mean if {{ip <-> node id}} pair exists in the directory, we have to free up the IP address before the other node can claim it. So the test would look as follows: So for swapping {{.2}} and {{.3}}, one of the nodes would have to migrate to {{.4}} first, and only then can the freed up IP address be occupied again. Submitting a patch that fixes the peers table behaviour and codifies a requirement of a separate node for swapping addresses. > CMS: Nodes can restart with new ipaddress already defined in the cluster > > > Key: CASSANDRA-19221 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19221 > Project: Cassandra > Issue Type: Bug > Components: Transactional Cluster Metadata >Reporter: Paul Chandler >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.1-alpha1 > > > I am simulating running a cluster in Kubernetes and testing what happens when > several pods go down and ip addresses are swapped between nodes. In 4.0 this > is blocked and the node cannot be restarted. > To simulate this I create a 3 node cluster on a local machine using 3 > loopback addresses > {code} > 127.0.0.1 > 127.0.0.2 > 127.0.0.3 > {code} > The nodes are created correctly and the first node is assigned as a CMS node > as shown: > {code} > bin/nodetool -p 7199 describecms > {code} > Cluster Metadata Service: > {code} > Members: /127.0.0.1:7000 > Is Member: true > Service State: LOCAL > {code} > At this point I bring down the nodes 127.0.0.2 and 127.0.0.3 and swap the ip > addresses for the rpc_address and listen_address > > The nodes come back as normal, but the nodeid has now been swapped against > the ip address: > Before: > {code} > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) Host ID > Rack > UN 127.0.0.3 75.2 KiB
[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys
[ https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19128: Source Control Link: https://github.com/apache/cassandra/commit/7623e4678b8ef131434f1de3522c6425c092dff9 Resolution: Fixed Status: Resolved (was: Ready to Commit) > The result of applying a metadata snapshot via ForceSnapshot should return > the correct set of modified keys > --- > > Key: CASSANDRA-19128 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19128 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Membership >Reporter: Marcus Eriksson >Assignee: Alex Petrov >Priority: High > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 50m > Remaining Estimate: 0h > > It should use the same logic as Transformer::build to compare the updated CM > with the previous to derive the modified keys -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys
[ https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19128: Reviewers: Marcus Eriksson Status: Review In Progress (was: Patch Available) > The result of applying a metadata snapshot via ForceSnapshot should return > the correct set of modified keys > --- > > Key: CASSANDRA-19128 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19128 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Membership >Reporter: Marcus Eriksson >Assignee: Alex Petrov >Priority: High > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 50m > Remaining Estimate: 0h > > It should use the same logic as Transformer::build to compare the updated CM > with the previous to derive the modified keys -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys
[ https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837615#comment-17837615 ] Alex Petrov commented on CASSANDRA-19128: - [~marcuse] left his +1 on the pull request. > The result of applying a metadata snapshot via ForceSnapshot should return > the correct set of modified keys > --- > > Key: CASSANDRA-19128 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19128 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Membership >Reporter: Marcus Eriksson >Assignee: Alex Petrov >Priority: High > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 50m > Remaining Estimate: 0h > > It should use the same logic as Transformer::build to compare the updated CM > with the previous to derive the modified keys -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19128) The result of applying a metadata snapshot via ForceSnapshot should return the correct set of modified keys
[ https://issues.apache.org/jira/browse/CASSANDRA-19128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19128: Status: Ready to Commit (was: Review In Progress) > The result of applying a metadata snapshot via ForceSnapshot should return > the correct set of modified keys > --- > > Key: CASSANDRA-19128 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19128 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Membership >Reporter: Marcus Eriksson >Assignee: Alex Petrov >Priority: High > Fix For: 5.1-alpha1 > > Attachments: ci_summary-1.html, ci_summary.html > > Time Spent: 50m > Remaining Estimate: 0h > > It should use the same logic as Transformer::build to compare the updated CM > with the previous to derive the modified keys -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets
[ https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837573#comment-17837573 ] Alex Petrov edited comment on CASSANDRA-19344 at 4/16/24 8:04 AM: -- Wanted to point out a somewhat unintuitive albeit correct behaviour that involves Transient Replicas. I think it is worth talking through such things because pending ranges with transient replicas work slightly differently from their "normal" counterparts. We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, and 4 moving from 400 to 350. Original/start state (READ/WRITE placements): {code} (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]} (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]} {code} State after {{START_MOVE}} (which is the point at which streaming starts, so think of additional replicas as pending), for WRITE placements: {code} (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]} (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} READ placements at the same moment: {code} (400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]} (MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} Please note that READ placements are always a subset of WRITE ones (or, well, in a way: we can technically read from full to satisfy a transient read) after FINISH_MOVE, we get for both READ and WRITE: {code} (400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]} (MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} After executing START_MOVE, we get 3 full and no transient nodes for {{(200,300]}}. If we put transitions together, we see: {code} 1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]} 2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]} 3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} {code} In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is now gaining a range, and should be a target for pending writes for this range. At the same time, it remains a _transient read replica_. In {{3.}}, {{127.0.0.04}} went from full to transient; it was kept full up till now since it was a streaming source, and to keep consistency levels correct, we What is unintuitive here
[jira] [Commented] (CASSANDRA-19344) Range movements involving transient replicas must safely enact changes to read and write replica sets
[ https://issues.apache.org/jira/browse/CASSANDRA-19344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837573#comment-17837573 ] Alex Petrov commented on CASSANDRA-19344: - Wanted to point out a somewhat unintuitive albeit correct behaviour that involves Transient Replicas. I think it is worth talking through such things because pending ranges with transient replicas work slightly differently from their "normal" counterparts. We have a four node cluster with nodes 1,2,3,4 owning tokens 100,200,300,400, and 4 moving from 400 to 350. Original/start state (READ/WRITE placements): {code] (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Transient(/127.0.0.3:7012,(400,MIN])]} (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Transient(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.1:7012,(350,400]), Transient(/127.0.0.2:7012,(350,400])]} {code} State after {{START_MOVE}} (which is the point at which streaming starts, so think of additional replicas as pending), for WRITE placements: {code} (400,MIN] -> [Full(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:701a2,(400,MIN])]} (MIN,100] -> [Full(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.4:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.1:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} READ placements at the same moment: {code} (400,MIN] -> [Transient(/127.0.0.1:7012,(400,MIN]), Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN])]} (MIN,100] -> [Transient(/127.0.0.1:7012,(MIN,100]), Full(/127.0.0.2:7012,(MIN,100]), Full(/127.0.0.3:7012,(MIN,100])]} (100,200] -> [Full(/127.0.0.2:7012,(100,200]), Full(/127.0.0.3:7012,(100,200]), Transient(/127.0.0.1:7012,(100,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} Please note that READ placements are always a subset of WRITE ones (or, well, in a way: we can technically read from full to satisfy a transient read) after FINISH_MOVE, we get for both READ and WRITE: {code} (400,MIN] -> [Full(/127.0.0.2:7012,(400,MIN]), Full(/127.0.0.3:7012,(400,MIN]), Transient(/127.0.0.1:7012,(400,MIN])]} (MIN,200] -> [Full(/127.0.0.2:7012,(MIN,200]), Transient(/127.0.0.1:7012,(MIN,200]), Full(/127.0.0.3:7012,(MIN,200])]} (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} (300,350] -> [Full(/127.0.0.1:7012,(300,350]), Transient(/127.0.0.2:7012,(300,350]), Full(/127.0.0.4:7012,(300,350])]} (350,400] -> [Full(/127.0.0.4:7012,(350,400]), Full(/127.0.0.2:7012,(350,400]), Transient(/127.0.0.3:7012,(350,400])]} {code} After executing START_MOVE, we get 3 full and no transient nodes for {{(200,300]}}. If we put transitions together, we see: {code} 1. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Transient(/127.0.0.1:7012,(200,300])]} 2. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.4:7012,(200,300]), Full(/127.0.0.1:7012,(200,300])]} 3. (200,300] -> [Full(/127.0.0.3:7012,(200,300]), Full(/127.0.0.1:7012,(200,300]), Transient(/127.0.0.4:7012,(200,300])]} {code} In {{2.}}, you see that {{127.0.0.1}} went from transient to full, since it is now gaining a range, and should be a target for pending writes for this range. At the same time, it remains a _transient read replica_. In {{3.}}, {{127.0.0.04}} went from full to transient; it was kept full up till now since it was a streaming source, and to keep consistency levels correct, we What is unintuitive here is that usually, with replication factor of 3, we
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837419#comment-17837419 ] Alex Petrov commented on CASSANDRA-19534: - Sounds good, I'll tag you as soon as I have it up. Thank you [~rustyrazorblade]! > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability
[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837416#comment-17837416 ] Alex Petrov commented on CASSANDRA-19534: - [~rustyrazorblade] oh yes, that would exist in a single node as well. Think of a single node as a case of RF1 and coordinator and replica are colocated. I have just finished the last wrinkle in my patch, now just need to rebase and hope to post it ASAP. Hope it's not pressing, but wanted to indicate that unless you already have a patch for this, probably the quickest way is to check out what I got as what you describe should be well covered. > unbounded queues in native transport requests lead to node instability > -- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths >Reporter: Jon Haddad >Assignee: Alex Petrov >Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=10 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-12937) Default setting (yaml) for SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837237#comment-17837237 ] Alex Petrov commented on CASSANDRA-12937: - bq. Yes, I think this is the most ideal solution. If somebody wants to experiment with a new compressor and similar, there would need to be some knob to override it, like some JMX method or similar, and all risks attached to that (divergence of the configuration caused by operator's negligence) would be on him. Some things are actually quite useful for gradual rollout. For example, compression. You probably do not want to rewrite your sstables across the entire cluster. Similar arguments may be made for canary deployments of memtable settings and other things. I agree that it is fine if these parameters are completely transient (i.e. if you have set it to something that diverges from the clusterwide value, it will get reverted back after the node bounce). In such case, probably they will not go through TCM and will be purely node-local. Examples of things that are now configuable via yaml but will be configurable via TCM if we go ahead with this proposal: partitioner, memtable configuration, default compaction strategy, compression. As Sam has mentioned, "which specific value makes it into schema just depends on which instance acts as the coordinator for a given DCL statement". bq. but I remain unconvinced that just picking the defaults from whatever node happens to be coordinating is the right way to go. I have talked with Sam shortly just to make sure I understand it correctly before trying to describe it. Since this was first worded in a way that suggested a problem but not directly proposed a solution (possibly described elsewhere), I will attempt to do this. Sam has already described a part of the solution as: bq. That should probably be in a parallel local datastructure though, not in the node's local log table as we don't want to ship those local defaults to peers when providing log catchup (because they should use their own defaults). The part that was missing for me was where would the values be coming from, and what would be the precedence. When executing a {CREATE} statement on some node _without_ specifying, say, compression, the statement will be created and executed without the value for compression set at all. Every node will pick the value from its ephemeral parallel structure Sam described (which is also settable via JMX and alike like Stefan mentioned). If no value is present in this table, it will be picked from yaml (alternatively, we could just populate this structure from yaml, too, but I consider these things roughly equivalent). > Default setting (yaml) for SSTable compression > -- > > Key: CASSANDRA-12937 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12937 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Michael Semb Wever >Assignee: Stefan Miklosovic >Priority: Low > Labels: AdventCalendar2021 > Fix For: 5.x > > Time Spent: 8h > Remaining Estimate: 0h > > In many situations the choice of compression for sstables is more relevant to > the disks attached than to the schema and data. > This issue is to add to cassandra.yaml a default value for sstable > compression that new tables will inherit (instead of the defaults found in > {{CompressionParams.DEFAULT}}. > Examples where this can be relevant are filesystems that do on-the-fly > compression (btrfs, zfs) or specific disk configurations or even specific C* > versions (see CASSANDRA-10995 ). > +Additional information for newcomers+ > Some new fields need to be added to {{cassandra.yaml}} to allow specifying > the field required for defining the default compression parameters. In > {{DatabaseDescriptor}} a new {{CompressionParams}} field should be added for > the default compression. This field should be initialized in > {{DatabaseDescriptor.applySimpleConfig()}}. At the different places where > {{CompressionParams.DEFAULT}} was used the code should call > {{DatabaseDescriptor#getDefaultCompressionParams}} that should return some > copy of configured {{CompressionParams}}. > Some unit test using {{OverrideConfigurationLoader}} should be used to test > that the table schema use the new default when a new table is created (see > CreateTest for some example). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-12937) Default setting (yaml) for SSTable compression
[ https://issues.apache.org/jira/browse/CASSANDRA-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837237#comment-17837237 ] Alex Petrov edited comment on CASSANDRA-12937 at 4/15/24 1:08 PM: -- bq. Yes, I think this is the most ideal solution. If somebody wants to experiment with a new compressor and similar, there would need to be some knob to override it, like some JMX method or similar, and all risks attached to that (divergence of the configuration caused by operator's negligence) would be on him. Some things are actually quite useful for gradual rollout. For example, compression. You probably do not want to rewrite your sstables across the entire cluster. Similar arguments may be made for canary deployments of memtable settings and other things. I agree that it is fine if these parameters are completely transient (i.e. if you have set it to something that diverges from the clusterwide value, it will get reverted back after the node bounce). In such case, probably they will not go through TCM and will be purely node-local. Examples of things that are now configuable via yaml but will be configurable via TCM if we go ahead with this proposal: partitioner, memtable configuration, default compaction strategy, compression. As Sam has mentioned, "which specific value makes it into schema just depends on which instance acts as the coordinator for a given DCL statement". bq. but I remain unconvinced that just picking the defaults from whatever node happens to be coordinating is the right way to go. I have talked with Sam shortly just to make sure I understand it correctly before trying to describe it. Since this was first worded in a way that suggested a problem but not directly proposed a solution (possibly described elsewhere), I will attempt to do this. Sam has already described a part of the solution as: bq. That should probably be in a parallel local datastructure though, not in the node's local log table as we don't want to ship those local defaults to peers when providing log catchup (because they should use their own defaults). The part that was missing for me was where would the values be coming from, and what would be the precedence. When executing a {{CREATE}} statement on some node _without_ specifying, say, compression, the statement will be created and executed without the value for compression set at all. Every node will pick the value from its ephemeral parallel structure Sam described (which is also settable via JMX and alike like Stefan mentioned). If no value is present in this table, it will be picked from yaml (alternatively, we could just populate this structure from yaml, too, but I consider these things roughly equivalent). was (Author: ifesdjeen): bq. Yes, I think this is the most ideal solution. If somebody wants to experiment with a new compressor and similar, there would need to be some knob to override it, like some JMX method or similar, and all risks attached to that (divergence of the configuration caused by operator's negligence) would be on him. Some things are actually quite useful for gradual rollout. For example, compression. You probably do not want to rewrite your sstables across the entire cluster. Similar arguments may be made for canary deployments of memtable settings and other things. I agree that it is fine if these parameters are completely transient (i.e. if you have set it to something that diverges from the clusterwide value, it will get reverted back after the node bounce). In such case, probably they will not go through TCM and will be purely node-local. Examples of things that are now configuable via yaml but will be configurable via TCM if we go ahead with this proposal: partitioner, memtable configuration, default compaction strategy, compression. As Sam has mentioned, "which specific value makes it into schema just depends on which instance acts as the coordinator for a given DCL statement". bq. but I remain unconvinced that just picking the defaults from whatever node happens to be coordinating is the right way to go. I have talked with Sam shortly just to make sure I understand it correctly before trying to describe it. Since this was first worded in a way that suggested a problem but not directly proposed a solution (possibly described elsewhere), I will attempt to do this. Sam has already described a part of the solution as: bq. That should probably be in a parallel local datastructure though, not in the node's local log table as we don't want to ship those local defaults to peers when providing log catchup (because they should use their own defaults). The part that was missing for me was where would the values be coming from, and what would be the precedence. When executing a {CREATE} statement on some node _without_ specifying, say, compression, the statement will be created and
[jira] [Updated] (CASSANDRA-19517) Raise priority of TCM internode messages during critical operations
[ https://issues.apache.org/jira/browse/CASSANDRA-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19517: Test and Documentation Plan: Includes tests. Additional stress testing will be done during release qualification. Status: Patch Available (was: Open) > Raise priority of TCM internode messages during critical operations > --- > > Key: CASSANDRA-19517 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19517 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html, result_details.tar.gz > > > In a busy cluster, TCM messages may not get propagated throughout the > cluster, since they will be ordered together with other P1 messages (for > {{TCM_}} prefixed verbs), and with P2 with all Paxos operations. > To avoid this, and make sure we can continue cluster metadata changes, all > {{TCM_}}-prefixed verbs should have {{P0}} priority, just like Gossip > messages used to. All Paxos messages that involve distributed metadata > keyspace should now get an {{URGENT}} flag, which will instruct internode > messaging to schedule them on the {{URGENT_MESSAGES}} connection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-19517) Raise priority of TCM internode messages during critical operations
[ https://issues.apache.org/jira/browse/CASSANDRA-19517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-19517: Attachment: result_details.tar.gz > Raise priority of TCM internode messages during critical operations > --- > > Key: CASSANDRA-19517 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19517 > Project: Cassandra > Issue Type: Improvement > Components: Transactional Cluster Metadata >Reporter: Alex Petrov >Assignee: Alex Petrov >Priority: Normal > Attachments: ci_summary.html, result_details.tar.gz > > > In a busy cluster, TCM messages may not get propagated throughout the > cluster, since they will be ordered together with other P1 messages (for > {{TCM_}} prefixed verbs), and with P2 with all Paxos operations. > To avoid this, and make sure we can continue cluster metadata changes, all > {{TCM_}}-prefixed verbs should have {{P0}} priority, just like Gossip > messages used to. All Paxos messages that involve distributed metadata > keyspace should now get an {{URGENT}} flag, which will instruct internode > messaging to schedule them on the {{URGENT_MESSAGES}} connection. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org