[ https://issues.apache.org/jira/browse/CASSANDRA-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832727#comment-15832727 ]
Simon Zhou commented on CASSANDRA-13009: ---------------------------------------- [~tjake], do you mind reviewing this patch? Thanks. > Speculative retry bugs > ---------------------- > > Key: CASSANDRA-13009 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13009 > Project: Cassandra > Issue Type: Bug > Reporter: Simon Zhou > Assignee: Simon Zhou > Fix For: 3.0.11 > > Attachments: CASSANDRA-13009-v1.patch > > > There are a few issues with speculative retry: > 1. Time unit bugs. These are from ColumnFamilyStore (v3.0.10): > The left hand side is in nanos, as the name suggests, while the right hand > side is in millis. > {code} > sampleLatencyNanos = DatabaseDescriptor.getReadRpcTimeout() / 2; > {code} > Here coordinatorReadLatency is already in nanos and we shouldn't multiple the > value by 1000. This was a regression in 8896a70 when we switch metrics > library and the two libraries use different time units. > {code} > sampleLatencyNanos = (long) > (metric.coordinatorReadLatency.getSnapshot().getValue(retryPolicy.threshold()) > * 1000d); > {code} > 2. Confusing overload protection and retry delay. As the name > "sampleLatencyNanos" suggests, it should be used to keep the actually sampled > read latency. However, we assign it the retry threshold in the case of > CUSTOM. Then we compare the retry threshold with read timeout (defaults to > 5000ms). This means, if we use speculative_retry=10ms for the table, we won't > be able to avoid being overloaded. We should compare the actual read latency > with the read timeout for overload protection. See line 450 of > ColumnFamilyStore.java and line 279 of AbstractReadExecutor.java. > My proposals are: > a. We use sampled p99 delay and compare it with a customizable threshold > (-Dcassandra.overload.threshold) for overload detection. > b. Introduce another variable retryDelayNanos for waiting time before retry. > This is the value from table setting (PERCENTILE or CUSTOM). -- This message was sent by Atlassian JIRA (v6.3.4#6332)