[ https://issues.apache.org/jira/browse/CASSANDRA-19215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828423#comment-17828423 ]
Alex Petrov commented on CASSANDRA-19215: ----------------------------------------- [~curlylrt] we've determined that this change might be insufficient and have worked on a larger change that might be a better fit. I hope to submit a reworked version this week. > "Query start time" in native transport request threads should be the task > enqueue time > -------------------------------------------------------------------------------------- > > Key: CASSANDRA-19215 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19215 > Project: Cassandra > Issue Type: Bug > Components: Messaging/Client > Reporter: Runtian Liu > Assignee: Alex Petrov > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > Attachments: ci_summary.html, result_details.tar.gz > > > Recently, our Cassandra 4.0.6 cluster experienced an outage due to a surge in > expensive traffic from the application side. This surge involved a large > volume of costly read queries, which took a considerable amount of time to > process on the server side. The client had timeout settings; if a request > timed out, it might trigger the sending of new requests. Since the server > nodes were overloaded, numerous nodes had hundreds of thousands of tasks > queued in the Native-Transport-Request pending queue. I expected that once > the application ceased sending requests, the server node would quickly return > to normal, as most requests in the queue were over half an hour old and > should have timed out rapidly, clearing the queue. However, it actually took > an hour to clear the native transport's pending queue, even with native > transport disabled. Upon examining the code, I noticed that for read/write > requests, the > [queryStartNanoTime|https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/transport/Dispatcher.java#L78], > which determines if a request has timed out, only begins when the task > starts processing. This means that no matter how long a request has been > pending, it doesn't contribute to the timeout. I believe this is incorrect. > The timer should start when the Cassandra server receives the request or when > it enqueues the task, not when the request/task begins processing. This way, > an overloaded node with many pending tasks can quickly discard timed-out > requests and recover from an outage once new requests stop. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org