[ 
https://issues.apache.org/jira/browse/CASSANDRA-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Zytka updated CASSANDRA-19330:
------------------------------------
    Source Control Link: https://github.com/apache/cassandra/pull/3107

> speculative read replies may be created too eagerly and swamp the cluster
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19330
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19330
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Legacy/Coordination
>            Reporter: Jakub Zytka
>            Assignee: Jakub Zytka
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> This problem affects at least 4.x, newer, and possibly older versions.
> CASSANDRA-19215 proposed patch 
> ([https://github.com/apache/cassandra/pull/3064/files]) makes it trivial to 
> hit on the trunk.
> The description below uses the `trunk` code, but the same problems happen in 
> earlier versions even if, e.g., class names are changed, etc.
> In one of our tests with RF=3 and QUORUM reads, we observed that the number 
> of speculative read retries was of the same order of magnitude as the number 
> of client reads.
> In effect, this reduced the cluster throughput by roughly 1/3
> The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` 
> in particular suffer from several problems:
> 1. triggering a speculative retry on the basis of a comparison of two 
> distinct intervals
> The speculative retry is triggered on basis of:
>  - `CFS::sampleReadLatency*` field, which is sourced from the 
> `coordinatorReadLatency` metric
>  - `queryStartNanoTime` (`ReadCallback::await` called by 
> `AbstractReadExecutor::shouldSpeculateAndMaybeWait`)
> `coordinatorReadLatency` measures (i.a.) the duration of the 
> `StorageProxy::readRegular` function.
> `queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 (and 
> `cql_start_time = QUEUE`) the measurement happens much earlier
> The time distance difference between these two intervals is not accounted for 
> in any way, and they are treated as representing measurements of the same 
> interval in 
> `AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo
> This together means that saturating the server with reads, which causes a 
> buildup of `Native-Transport-Requests` queue, will inevitably trigger 
> needless speculative read retries, as the sampleReadLatency doesn't consider 
> this queue wait time.
> 2. triggering a speculative retry based on time spent coordinating
> I assume that speculative retries aim to reduce latency/prevent timeout when 
> some replicas that are contacted first (in 
> `AbstractReadExecutor::executeAsync`) do not respond timely, for whatever 
> reason.
> I understand the desire to limit the user-observable latency and thus base 
> the retry decision on the closest approximation of the user-observable 
> latency we have (`coordinatorReadLatency`). Still, I believe it is a flawed 
> approach.
> If the coordination part is fast, and some replicas are late, it doesn't 
> really matter if we include coordination time.
> If, on the other hand, for whatever reason, the coordination part gets slow, 
> triggering additional speculative retries will only make matters worse.
> 3. `coordinatorReadLatency` metric is counter-intuitive
> I understand this point is subjective. That said, I can't find a reason why 
> the `coordinatorReadLatency`
> should not cover the longest interval feasible, i.e. from 
> `queryStartNanoTime` till the end of execution,
> instead of a specific portion of the code (which, admittedly, does most of 
> the read-related work).
> This is another facet of point 1 - not only do we measure two different 
> intervals, but both seem to be measured incorrectly.
> 4. the metric values used to update `sampleReadLatency*` field are sticky to 
> histogram bucket ends
> This is more something to be aware of rather than a particular problem to 
> solve. Still, I felt it is worth mentioning because perhaps there's some 
> improvement to be made that I'm not aware of.
> We are using the `coordinatorReadLatency` metric as the source for 
> `sampleReadLatency*` field.
> Let's assume we're using 99th percentile of that metric. At its heart, the 
> metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual 
> percentile value is computed in `AbstractSnapshot::getValue`.
> The value returned by this function is always a bucket boundary 
> (`bucketOffsets[i]`).
> For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we 
> update the Timer only with measurements in the range 21-23ms range, then 
> *ALL* percentile values will equal 24ms.
> This is different than in other systems. E.g. in Prometheus, a linear 
> interpolation between bucket ends is performed to estimate a more 
> fine-grained value of a particular sample.
> The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket 
> covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms 
> timespan.
> Fortunately, the percentile value we get is the *upper* end of the bucket, 
> making the speculative retries more difficult to trigger.
> One additional note is that the stickiness of the percentiles stablizes the 
> behaviour of percentile-based decision makers, e.g. the 
> `DynamicEndpointSnitch`. Implementing linear interpolation in 
> `AbstractSnapshot::getValue` may unexpectedly influence such components.
> I will follow up with specific improvement proposals.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to