[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840058#comment-17840058 ]
Alex Petrov commented on CASSANDRA-19534: ----------------------------------------- The main change is the introduction of (currently implicit) configurable {_}native request deadline{_}. No request, read or write, will be allowed to prolong its execution beyond this deadline. Some of the hidden places that would allow requests to stay overdue were local executor runnables, replica-side writes, and hints. Default is 12 seconds, since this is how much time 3.x driver (which I believe is still the most used version in the community) waits until removing its handlers after which any response from the server will just be ignored. Now, there is an _option_ to enable expiration based on the queue time, which will be _disabled_ by default to preserve existing semantics, but my tests have shown enabling it only has positive effects. We will try it out cautiously in different clusters over the next months and will see if tests match up with real loads before we change any of the defaults. So by default behaviour will be as follows: # If request has spent more than 12 seconds in the NATIVE queue, we throw Overloaded exception back to the client. This timeout used to be max of read/write/range/counter rpc timeout. # If requests has spent less than 12 seconds, it is allowed to execute; any request issued by the coordinator can live: ## _either_ {{Verb.timeout}} number of milliseconds, ## _or_ up to the up to the native request deadline, as measured from the time when the request was admitted to the coordinator's NATIVE queue, whichever one of these is happening earlier. Example 1, read timeout is 5 seconds: # Client sends a request; request spends 6 seconds in the NATIVE queue # Coordinator issues requests to replicas; two replicas respond within 3 seconds # Coordinator responds to the client with success Example 2, read timeout is 5 seconds: # Client sends a request; request spends 6 seconds in the NATIVE queue # Coordinator issues requests to replicas; one replica responds within 3 seconds; other replicas fail to respond within 5 seconds of read timeout # Coordinator responds to the client with read timeout (preserves current behaviour) Example 3, read timeout is 5 seconds: # Client sends a request; request spends 10 seconds in the NATIVE queue # Coordinator issues requests to replicas; all replicas fail to respond within 2 seconds # Coordinator responds to the client with read timeout; if messages are still in queue on replicas, they will get dropped before processing There will be a _new_ metric that shows how many of the timeouts would have been “blind timeouts” previously. I.e. client _would_ register them as timeouts, but we as server-side operators would be oblivious to them. This metric will keep us collectively motivated even if we see there is a slight uptick in timeouts after committing the patch. Lastly, there is an option to say how much of the 12 seconds client requests are allowed to spend in the native queue. You can say that if there is a client request that has spent 80% of its max 12 seconds in the native queue, we start applying backpressure to the client socket (or throwing overloaded exception, depending on the value of {{{}native_transport_throw_on_overload{}}}). We have to be careful with enabling this one, since my tests have shown that while we see fewer timeouts server side, clients see more timeouts, because part of the time they consider “request time” is now spent somewhere in TCP queues, which we can not account for. h3. New Configuration Params h3. cql_start_time Configures what is considered to be a base for the replica-side timeout. This has actually existed before, it is now actually safe to enable. It still defaults to {{REQUEST}} (processing start time is taken as a timeout base), and an alternative is {{QUEUE}} (queue admission time is taken as a timeout base). Unfortunately, there is no consistent view of the timeout base in the community: some people think that server-side read/write timeouts are how much time _replicas_ have to respond to coordinator. Some believe they mean how much time _coordinator_ has to respond to the client. This patch is agnostic to these beliefs. h3. native_transport_throw_on_overload Whether we should apply backpressure to client (i.e. stop reading from the socket), or throw Overloaded exception. Default is socket backpressure, and this is probably fine for now. In principle, this can also be set by the client on per-connection basis via protocol options. However, 3.x series of the driver do not have this addition implemented, so in practice this is not really used. If used, setting from the client takes precedence. h3. native_transport_timeout_in_ms The absolute maximum amount of time the server has to respond to client. No work related to client-side request will be done after that period elapses. Default is 100 seconds, which is unreasonably high, but not unbounded. In practice, we should use at most 12 seconds. h3. native_transport_queue_max_item_age_threshold Works in conjunction with {{{}native_transport_timeout_in_ms{}}}. How much of the maximum time the oldest request in the native queue is allowed to spend in the queue before we start applying backpressure (or throwing overloaded exception, depending on what {{native_transport_throw_on_overload}} is set to). Default is “all the time needed”. We should set it to 0.5-0.8 (50-80% of 12 seconds). But I would leave it disabled at least until there's more testing done, since it hides the time request spent in TCP queues before it got to us. h3. native_transport_\{min|max}_backoff_on_queue_overload_ms If we start applying backpressure (see {{{}native_transport_timeout_in_ms{}}}), what is the minimum value. If the queue fills up above {{{}native_transport_queue_max_item_age_threshold{}}}, in other words, request has been sitting more than a % of {{native_transport_timeout_in_ms}} in the queue, we start an incident. Incident starts by marking the incident time and raising severity level to 1. Each time we observe an old item in the head of the queue, we first bump then number of times we have applied the backpressure. After bumping it 10 times, we increase severity level by Backpressure delay applied to the client socket is computed by multiplying the severity level by the minimum delay. If we have not seen old requests in the head of the queue for 1 second, we close the incident. If the queue remains saturated for a prolonged period, the amount of delay will increase in proportion to the request rate as appliedTimes & severityLevel are incremented. If no new requests are considered overloaded in this way for a second, the incident will be reset and so the delay will drop back down to minimum delay. h2. Newly Introduced Metrics * {{{}ClientMetrics/{}}}{{{}ConnectionPaused{}}} number of times client connection was paused due to backpressure. * {{ThreadPoolMetrics/OldestTaskQueueTime}} the age (in milliseconds) of the oldest task in the given queue. * {{{}ClientMetrics/{}}}{{{}Queued{}}} - for how long (in nanoseconds) the item was queued before processing. h2. Testing and Examples Scenario 1: constant overload {code:java} Stock | QUEUE | QUEUE + backpressure success | 595 | 56179 | 21540 timedOut | 101180| 45647 | 80310 overloaded | 66 | 0 | 0 {code} Scenario 2: sudden burst, followed by the drop {code:java} Stock | QUEUE | QUEUE+backpressure success | 46305 | 41087 | 46021 timedOut | 5016 | 10121 | 5204 overloaded | 0 | 5204 | 0 {code} How to interpret the results: overloaded cluster eventually falls into the state where it does not serve any good traffic at all (in real life continue serving more traffic, but clients would experience a substantial number of timeouts). It is very busy and spinning on all gears, all queues are full. However, because timeout base is “when we began processing the request” (disregarding the potentially long queue time), we serve no (almost) good traffic. Above, you can see that the number of successes in the “old” row is very little. This is a number of requests that we are able to complete successfully after increasing 50ms of artificial latency in the READ verb handler. During overload, we now instead shed bad (timed out) requests and do not attempt to serve them. In addition to this, we apply backpressure to the clients to attempt to push them back a little. While in the cluster under constant overload pressure this is just “the best we can do”, in real life such events are rare and short, so all we need to do is to not allow a sudden short-time burst to stay around for long enough for the cluster to serve less good traffic. You can see that backpressure does not add much on top of the QUEUE; server-side metrics will be slightly better, but client-side, we will keep timing out since client does not know if the request was paused because of the server-side queue delay, or request was buffered in the TCP queue. We still believe that we do need client backpressure (or load shedding), since in case of a sudden burst in requests that eventually tapers off, backpressure does have a positive effect. There may be a _slight_ change in metrics after that change, but added time will be constant and very insignificant, maybe even not significant enough for the operators to notice. Previously, {{nanoTime}} for metric calculation was grabbed _in the middle of the request_ on the coordinator side. Now, we will track entire coordinated path. But this is just a few method calls both on the read and write paths. > unbounded queues in native transport requests lead to node instability > ---------------------------------------------------------------------- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Jon Haddad > Assignee: Alex Petrov > Priority: Normal > Fix For: 5.0-rc, 5.x > > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=100000 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org