[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Brandon Williams (Jira) Mon, 29 Apr 2024 14:06:09 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842164#comment-17842164
 ]


Brandon Williams commented on CASSANDRA-19534:
----------------------------------------------

bq. I'd suggest setting cql_start_time to REQUEST

This appears to be the default in the patch, so first I ran with no config 
changes. Here are the KeyValue ECS numbers while the random workload is also 
running with an increased rate of 300:
{noformat}
                 Writes                                  Reads                  
                Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |      Count  1min (errors/s)
1035374      100254.13             0 |  828223       100252.6             0 |   
    0              0             0 |   91260303          20005.4
1035374      100254.13             0 |  828223       100252.6             0 |   
    0              0             0 |   91320989         20007.02
1035374      100254.13             0 |  828223       100252.6             0 |   
    0              0             0 |   91380356         20007.02
1035374      100254.13             0 |  828223       100252.6             0 |   
    0              0             0 |   91441015         19976.79
{noformat}

We can see the 100ms native transport timeout default which is stable, and with 
the ECS rate set to 20k/s it is doing nothing but throwing errors at this 
point.  There was also a good amount of GC pressure.

With the native transport timeout adjusted to 12s:
{noformat}
                 Writes                                  Reads                  
                Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
6362953       12019.36       7602.56 | 6346212       12016.37       7581.98 |   
    0              0             0 | 1639458          4976.36
6384650       12016.84        7566.8 | 6367878       12023.32       7553.07 |   
    0              0             0 | 1655989          5033.01
6405461       12016.84        7566.8 | 6388707       12023.32       7553.07 |   
    0              0             0 | 1674127          5033.01
6426641       12016.84       7510.02 | 6409624       12021.76        7493.9 |   
    0              0             0 | 1693822          5158.58
{noformat}

We can see the timeout reflected again, but this time without heap pressure it 
continues to serve many requests.

Finally, here is cql_start_time set to QUEUE and the native transport timeout 
at 12s:
{noformat}
                 Writes                                  Reads                  
                Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
 505121       11983.81         53.36 |  794926        6334.45        113.39 |   
    0              0             0 | 5350041          19782.8
 505123       11983.81         49.13 |  794926        6334.45        104.33 |   
    0              0             0 | 5410428         19815.76
 505137       11983.81         49.13 |  794926        6334.45        104.33 |   
    0              0             0 | 5468740         19815.76
 505145       11983.81         45.53 |  794926        6334.45         95.99 |   
    0              0             0 | 5528104         19848.02
{noformat}

This also ended up throwing errors but still respected the timeout.

This patch appears to solve the runaway latency growth as requests never last 
beyond the native transport timeout.  I still think the 100s default is too 
high; it's the closest to the unbounded behavior from before but still 
detrimental and probably not what most people actually want especially since it 
may exert additional GC pressure.

> unbounded queues in native transport requests lead to node instability
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-19534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Local Write-Read Paths
>            Reporter: Jon Haddad
>            Assignee: Alex Petrov
>            Priority: Normal
>             Fix For: 5.0-rc, 5.x
>
>         Attachments: Scenario 1 - QUEUE + Backpressure.jpg, Scenario 1 - 
> QUEUE.jpg, Scenario 1 - Stock.jpg, Scenario 2 - QUEUE + Backpressure.jpg, 
> Scenario 2 - QUEUE.jpg, Scenario 2 - Stock.jpg, ci_summary.html
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Reply via email to