[ https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Petrov updated CASSANDRA-19534: ------------------------------------ Since Version: 4.1.5 Source Control Link: https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a Resolution: Fixed Status: Resolved (was: Ready to Commit) Committed to 4.1 with [dc17c29724d86547538cc8116ff1a90d36a0bf3a|https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a] and merged up to [5.0|https://github.com/apache/cassandra/commit/617a75843c9bfaf241249514f9604466f6c8ccab] and [trunk|https://github.com/apache/cassandra/commit/d10008d54bfb301ba12d022037b1caf78f18418b]. > Unbounded queues in native transport requests lead to node instability > ---------------------------------------------------------------------- > > Key: CASSANDRA-19534 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19534 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Local Write-Read Paths > Reporter: Jon Haddad > Assignee: Alex Petrov > Priority: Normal > Fix For: 4.1.x, 5.0-rc, 5.x > > Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - > QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, > Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary-4.1.html, > ci_summary-5.0.html, ci_summary-trunk.html, ci_summary.html, > image-2024-05-03-16-08-10-101.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png, screenshot-4.png, screenshot-5.png, screenshot-6.png, > screenshot-7.png, screenshot-8.png, screenshot-9.png > > Time Spent: 9h 50m > Remaining Estimate: 0h > > When a node is under pressure, hundreds of thousands of requests can show up > in the native transport queue, and it looks like it can take way longer to > timeout than is configured. We should be shedding load much more > aggressively and use a bounded queue for incoming work. This is extremely > evident when we combine a resource consuming workload with a smaller one: > Running 5.0 HEAD on a single node as of today: > {noformat} > # populate only > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate > 10m --rate 50k -n 1 > # workload 1 - larger reads > easy-cass-stress run RandomPartitionAccess -p 100 -r 1 > --workload.rows=100000 --workload.select=partition --rate 200 -d 1d > # second workload - small reads > easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat} > It appears our results don't time out at the requested server time either: > > {noformat} > Writes Reads > Deletes Errors > Count Latency (p99) 1min (req/s) | Count Latency (p99) 1min (req/s) | > Count Latency (p99) 1min (req/s) | Count 1min (errors/s) > 950286 70403.93 634.77 | 789524 70442.07 426.02 | > 0 0 0 | 9580484 18980.45 > 952304 70567.62 640.1 | 791072 70634.34 428.36 | > 0 0 0 | 9636658 18969.54 > 953146 70767.34 640.1 | 791400 70767.76 428.36 | > 0 0 0 | 9695272 18969.54 > 956833 71171.28 623.14 | 794009 71175.6 412.79 | > 0 0 0 | 9749377 19002.44 > 959627 71312.58 656.93 | 795703 71349.87 435.56 | > 0 0 0 | 9804907 18943.11{noformat} > > After stopping the load test altogether, it took nearly a minute before the > requests were no longer queued. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org