[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851018#comment-17851018 ] Alex Petrov commented on CASSANDRA-19215: - This should be fixed by [CASSANDRA-19534]. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844258#comment-17844258 ] Alex Petrov commented on CASSANDRA-19215: - This is now largely superseded by work on [CASSANDRA-19534], as I have posted the patch there. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828423#comment-17828423 ] Alex Petrov commented on CASSANDRA-19215: - [~curlylrt] we've determined that this change might be insufficient and have worked on a larger change that might be a better fit. I hope to submit a reworked version this week. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809674#comment-17809674 ] Runtian Liu commented on CASSANDRA-19215: - [~ifesdjeen] any update on this one? Happy to help review if you have the patch ready. Thanks. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804225#comment-17804225 ] Alex Petrov commented on CASSANDRA-19215: - I have a patch for this pending, will work to CI and submit this ASAP. > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Assignee: Alex Petrov >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time
[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801775#comment-17801775 ] Brandon Williams commented on CASSANDRA-19215: -- [~samt] can you take a look? > "Query start time" in native transport request threads should be the task > enqueue time > -- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client >Reporter: Runtian Liu >Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org