[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-31 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851018#comment-17851018
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

This should be fixed by [CASSANDRA-19534].

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-05-07 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17844258#comment-17844258
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

This is now largely superseded by work on [CASSANDRA-19534], as I have posted 
the patch there.

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-03-19 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828423#comment-17828423
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

[~curlylrt] we've determined that this change might be insufficient and have 
worked on a larger change that might be a better fit. I hope to submit a 
reworked version this week.

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Attachments: ci_summary.html, result_details.tar.gz
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-01-22 Thread Runtian Liu (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809674#comment-17809674
 ] 

Runtian Liu commented on CASSANDRA-19215:
-

[~ifesdjeen] any update on this one? Happy to help review if you have the patch 
ready. Thanks.

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-01-08 Thread Alex Petrov (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804225#comment-17804225
 ] 

Alex Petrov commented on CASSANDRA-19215:
-

I have a patch for this pending, will work to CI and submit this ASAP.

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Assignee: Alex Petrov
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-19215) "Query start time" in native transport request threads should be the task enqueue time

2024-01-02 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801775#comment-17801775
 ] 

Brandon Williams commented on CASSANDRA-19215:
--

[~samt] can you take a look?

> "Query start time" in native transport request threads should be the task 
> enqueue time
> --
>
> Key: CASSANDRA-19215
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19215
> Project: Cassandra
>  Issue Type: Bug
>  Components: Messaging/Client
>Reporter: Runtian Liu
>Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in 
> expensive traffic from the application side. This surge involved a large 
> volume of costly read queries, which took a considerable amount of time to 
> process on the server side. The client had timeout settings; if a request 
> timed out, it might trigger the sending of new requests. Since the server 
> nodes were overloaded, numerous nodes had hundreds of thousands of tasks 
> queued in the Native-Transport-Request pending queue. I expected that once 
> the application ceased sending requests, the server node would quickly return 
> to normal, as most requests in the queue were over half an hour old and 
> should have timed out rapidly, clearing the queue. However, it actually took 
> an hour to clear the native transport's pending queue, even with native 
> transport disabled. Upon examining the code, I noticed that for read/write 
> requests, the 
> [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78],
>  which determines if a request has timed out, only begins when the task 
> starts processing. This means that no matter how long a request has been 
> pending, it doesn't contribute to the timeout. I believe this is incorrect. 
> The timer should start when the Cassandra server receives the request or when 
> it enqueues the task, not when the request/task begins processing. This way, 
> an overloaded node with many pending tasks can quickly discard timed-out 
> requests and recover from an outage once new requests stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org