[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Jon Haddad (Jira) Fri, 03 May 2024 15:37:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843376#comment-17843376
 ]


Jon Haddad commented on CASSANDRA-19534:
----------------------------------------

I'm running these two workloads concurrently:

{noformat}
easy-cass-stress run RandomPartitionAccess --rate 50k -r .5 -d 10h  
--workload.rows=100000 --workload.select=partition
easy-cass-stress run KeyValue --rate 50k -r .5 -d 24h -p 10m --populate 100m
{noformat}

In this screenshot, the top node is running the branch, the other two are 
running 5.0-HEAD.

The first node has more completed native transport requests with a 
significantly less backed up queue:

 !screenshot-1.png! 

The cluster has reached a point where it's failing a ton, so I've stopped the 
workload to see how fast it recovers.  The cassandra0 node with the branch 
recovered almost immediately.  The other nodes took approximately 10 seconds.

I restarted the above two workloads and added a third:

{noformat}
easy-cass-stress run KeyValue --keyspace test1 
--field.keyvalue.value='random(1024,2048)' -p 1m -r .5 --populate 1m
{noformat}

The mixed nature of expensive and cheap reads is an easy way to create a deep 
queue for NTR.  It wasn't long before I got to this:

 !screenshot-2.png! 

It looks like load is being shed much faster off cassandra0:

 !screenshot-3.png! 

Within 10 seconds the first node has fully recovered, it took about 10 
additional for the other two nodes to recover as well.

 !screenshot-4.png! 

I've rerun this several times now and am finding [~ifesdjeen]'s patched version 
to recover quicker and have.  The boxes are all running at 99+% CPU, and 
cassandra0 each time continues to get more completed requests as well as 
maintain a more shallow queue and recover first.

 !screenshot-5.png! 

Starting my test with all 3 nodes running the patch.





> unbounded queues in native transport requests lead to node instability
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-19534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Jon Haddad
>            Assignee: Alex Petrov
>            Priority: Normal
>             Fix For: 4.1.x, 5.0-rc, 5.x
>
>         Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html, 
> screenshot-1.png, screenshot-2.png, screenshot-3.png, screenshot-4.png, 
> screenshot-5.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Reply via email to