[ https://issues.apache.org/jira/browse/CASSANDRA-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jakub Zytka updated CASSANDRA-19330: ------------------------------------ Bug Category: Parent values: Degradation(12984)Level 1 values: Performance Bug/Regression(12997) Complexity: Normal Component/s: Legacy/Coordination Discovered By: Adhoc Test Severity: Normal Assignee: Jakub Zytka Status: Open (was: Triage Needed) > speculative read replies may be created too eagerly and swamp the cluster > ------------------------------------------------------------------------- > > Key: CASSANDRA-19330 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19330 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Coordination > Reporter: Jakub Zytka > Assignee: Jakub Zytka > Priority: Normal > Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x > > > This problem affects at least 4.x, newer, and possibly older versions. > CASSANDRA-19215 proposed patch > ([https://github.com/apache/cassandra/pull/3064/files]) makes it trivial to > hit on the trunk. > The description below uses the `trunk` code, but the same problems happen in > earlier versions even if, e.g., class names are changed, etc. > In one of our tests with RF=3 and QUORUM reads, we observed that the number > of speculative read retries was of the same order of magnitude as the number > of client reads. > In effect, this reduced the cluster throughput by roughly 1/3 > The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` > in particular suffer from several problems: > 1. triggering a speculative retry on the basis of a comparison of two > distinct intervals > The speculative retry is triggered on basis of: > - `CFS::sampleReadLatency*` field, which is sourced from the > `coordinatorReadLatency` metric > - `queryStartNanoTime` (`ReadCallback::await` called by > `AbstractReadExecutor::shouldSpeculateAndMaybeWait`) > `coordinatorReadLatency` measures (i.a.) the duration of the > `StorageProxy::readRegular` function. > `queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 (and > `cql_start_time = QUEUE`) the measurement happens much earlier > The time distance difference between these two intervals is not accounted for > in any way, and they are treated as representing measurements of the same > interval in > `AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo > This together means that saturating the server with reads, which causes a > buildup of `Native-Transport-Requests` queue, will inevitably trigger > needless speculative read retries, as the sampleReadLatency doesn't consider > this queue wait time. > 2. triggering a speculative retry based on time spent coordinating > I assume that speculative retries aim to reduce latency/prevent timeout when > some replicas that are contacted first (in > `AbstractReadExecutor::executeAsync`) do not respond timely, for whatever > reason. > I understand the desire to limit the user-observable latency and thus base > the retry decision on the closest approximation of the user-observable > latency we have (`coordinatorReadLatency`). Still, I believe it is a flawed > approach. > If the coordination part is fast, and some replicas are late, it doesn't > really matter if we include coordination time. > If, on the other hand, for whatever reason, the coordination part gets slow, > triggering additional speculative retries will only make matters worse. > 3. `coordinatorReadLatency` metric is counter-intuitive > I understand this point is subjective. That said, I can't find a reason why > the `coordinatorReadLatency` > should not cover the longest interval feasible, i.e. from > `queryStartNanoTime` till the end of execution, > instead of a specific portion of the code (which, admittedly, does most of > the read-related work). > This is another facet of point 1 - not only do we measure two different > intervals, but both seem to be measured incorrectly. > 4. the metric values used to update `sampleReadLatency*` field are sticky to > histogram bucket ends > This is more something to be aware of rather than a particular problem to > solve. Still, I felt it is worth mentioning because perhaps there's some > improvement to be made that I'm not aware of. > We are using the `coordinatorReadLatency` metric as the source for > `sampleReadLatency*` field. > Let's assume we're using 99th percentile of that metric. At its heart, the > metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual > percentile value is computed in `AbstractSnapshot::getValue`. > The value returned by this function is always a bucket boundary > (`bucketOffsets[i]`). > For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we > update the Timer only with measurements in the range 21-23ms range, then > *ALL* percentile values will equal 24ms. > This is different than in other systems. E.g. in Prometheus, a linear > interpolation between bucket ends is performed to estimate a more > fine-grained value of a particular sample. > The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket > covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms > timespan. > Fortunately, the percentile value we get is the *upper* end of the bucket, > making the speculative retries more difficult to trigger. > One additional note is that the stickiness of the percentiles stablizes the > behaviour of percentile-based decision makers, e.g. the > `DynamicEndpointSnitch`. Implementing linear interpolation in > `AbstractSnapshot::getValue` may unexpectedly influence such components. > I will follow up with specific improvement proposals. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org