[ 
https://issues.apache.org/jira/browse/CASSANDRA-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

T Jake Luciani updated CASSANDRA-13009:
---------------------------------------
    Fix Version/s: 2.2.9

> Speculative retry bugs
> ----------------------
>
>                 Key: CASSANDRA-13009
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13009
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Simon Zhou
>            Assignee: Simon Zhou
>             Fix For: 2.2.9, 3.0.11
>
>         Attachments: CASSANDRA-13009-v1.patch, CASSANDRA-13009-v2.patch
>
>
> There are a few issues with speculative retry:
> 1. Time unit bugs. These are from ColumnFamilyStore (v3.0.10):
> The left hand side is in nanos, as the name suggests, while the right hand 
> side is in millis.
> {code}
> sampleLatencyNanos = DatabaseDescriptor.getReadRpcTimeout() / 2;
> {code}
> Here coordinatorReadLatency is already in nanos and we shouldn't multiple the 
> value by 1000. This was a regression in 8896a70 when we switch metrics 
> library and the two libraries use different time units.
> {code}
> sampleLatencyNanos = (long) 
> (metric.coordinatorReadLatency.getSnapshot().getValue(retryPolicy.threshold())
>  * 1000d);
> {code}
> 2. Confusing overload protection and retry delay. As the name 
> "sampleLatencyNanos" suggests, it should be used to keep the actually sampled 
> read latency. However, we assign it the retry threshold in the case of 
> CUSTOM. Then we compare the retry threshold with read timeout (defaults to 
> 5000ms). This means, if we use speculative_retry=10ms for the table, we won't 
> be able to avoid being overloaded. We should compare the actual read latency 
> with the read timeout for overload protection. See line 450 of 
> ColumnFamilyStore.java and line 279 of AbstractReadExecutor.java.
> My proposals are:
> a. We use sampled p99 delay and compare it with a customizable threshold 
> (-Dcassandra.overload.threshold) for overload detection.
> b. Introduce another variable retryDelayNanos for waiting time before retry. 
> This is the value from table setting (PERCENTILE or CUSTOM).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to