[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Jon Haddad (Jira) Fri, 05 Apr 2024 13:47:04 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jon Haddad updated CASSANDRA-19534:
-----------------------------------
    Description: 
When a node is under pressure, hundreds of thousands of requests can show up in 
the native transport queue, and it looks like it can take way longer to timeout 
than is configured.  We should be shedding load much more aggressively and use 
a bounded queue for incoming work.  This is extremely evident when we combine a 
resource consuming workload with a smaller one:

Running 5.0 HEAD on a single node as of today:
{noformat}
# populate only
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 
--workload.select=partition --maxrlat 100 --populate 10m --rate 50k -n 1

# workload 1 - larger reads
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 
--workload.select=partition --rate 200 -d 1d

# second workload - small reads
easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
It appears our results don't time out at the requested server time either:

 
{noformat}
                 Writes                                  Reads                  
                Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
 950286       70403.93        634.77 |  789524       70442.07        426.02 |   
    0              0             0 | 9580484         18980.45
 952304       70567.62         640.1 |  791072       70634.34        428.36 |   
    0              0             0 | 9636658         18969.54
 953146       70767.34         640.1 |  791400       70767.76        428.36 |   
    0              0             0 | 9695272         18969.54
 956833       71171.28        623.14 |  794009        71175.6        412.79 |   
    0              0             0 | 9749377         19002.44
 959627       71312.58        656.93 |  795703       71349.87        435.56 |   
    0              0             0 | 9804907         18943.11{noformat}
 

After stopping the load test altogether, it took nearly a minute before the 
requests were no longer queued.

  was:
When a node is under pressure, hundreds of thousands of requests can show up in 
the native transport queue, and it looks like it can take way longer to timeout 
than is configured.  We should be shedding load much more aggressively and use 
a bounded queue for incoming work.  This is extremely evident when we combine a 
resource consuming workload with a smaller one:

Running 5.0 HEAD on a single node as of today:
{noformat}
# populate only
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 
--workload.select=partition --maxrlat 100 --populate 10m --rate 50k -n 1

# workload 1 - larger reads
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 
--workload.select=partition --rate 200 -d 1d

# second workload - small reads
easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
It appears our results don't time out at the requested server time either:

 
{noformat}
                 Writes                                  Reads                  
                Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   
Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
 950286       70403.93        634.77 |  789524       70442.07        426.02 |   
    0              0             0 | 9580484         18980.45
 952304       70567.62         640.1 |  791072       70634.34        428.36 |   
    0              0             0 | 9636658         18969.54
 953146       70767.34         640.1 |  791400       70767.76        428.36 |   
    0              0             0 | 9695272         18969.54
 956833       71171.28        623.14 |  794009        71175.6        412.79 |   
    0              0             0 | 9749377         19002.44
 959627       71312.58        656.93 |  795703       71349.87        435.56 |   
    0              0             0 | 9804907         18943.11{noformat}
 


> unbounded queues in native transport requests lead to node instability
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-19534
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19534
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jon Haddad
>            Priority: Normal
>
> When a node is under pressure, hundreds of thousands of requests can show up 
> in the native transport queue, and it looks like it can take way longer to 
> timeout than is configured.  We should be shedding load much more 
> aggressively and use a bounded queue for incoming work.  This is extremely 
> evident when we combine a resource consuming workload with a smaller one:
> Running 5.0 HEAD on a single node as of today:
> {noformat}
> # populate only
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 
> 10m --rate 50k -n 1
> # workload 1 - larger reads
> easy-cass-stress run RandomPartitionAccess -p 100  -r 1 
> --workload.rows=100000 --workload.select=partition --rate 200 -d 1d
> # second workload - small reads
> easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h{noformat}
> It appears our results don't time out at the requested server time either:
>  
> {noformat}
>                  Writes                                  Reads                
>                   Deletes                       Errors
>   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) | 
>   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
>  950286       70403.93        634.77 |  789524       70442.07        426.02 | 
>       0              0             0 | 9580484         18980.45
>  952304       70567.62         640.1 |  791072       70634.34        428.36 | 
>       0              0             0 | 9636658         18969.54
>  953146       70767.34         640.1 |  791400       70767.76        428.36 | 
>       0              0             0 | 9695272         18969.54
>  956833       71171.28        623.14 |  794009        71175.6        412.79 | 
>       0              0             0 | 9749377         19002.44
>  959627       71312.58        656.93 |  795703       71349.87        435.56 | 
>       0              0             0 | 9804907         18943.11{noformat}
>  
> After stopping the load test altogether, it took nearly a minute before the 
> requests were no longer queued.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-19534) unbounded queues in native transport requests lead to node instability

Reply via email to