[ 
https://issues.apache.org/jira/browse/CASSANDRA-19330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jakub Zytka updated CASSANDRA-19330:
------------------------------------
    Description: 
This problem affects at least 4.x, newer, and possibly older versions.
CASSANDRA-19215 proposed patch 
([https://github.com/apache/cassandra/pull/3064/files]) makes it trivial to hit 
on the trunk.
The description below uses the `trunk` code, but the same problems happen in 
earlier versions even if, e.g., class names are changed, etc.

In one of our tests with RF=3 and QUORUM reads, we observed that the number of 
speculative read retries was of the same order of magnitude as the number of 
client reads.

In effect, this reduced the cluster throughput by roughly 1/3

The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` 
in particular suffer from several problems:

1. triggering a speculative retry on the basis of a comparison of two distinct 
intervals

The speculative retry is triggered on basis of:
 - `CFS::sampleReadLatency*` field, which is sourced from the 
`coordinatorReadLatency` metric
 - `queryStartNanoTime` (`ReadCallback::await` called by 
`AbstractReadExecutor::shouldSpeculateAndMaybeWait`)

`coordinatorReadLatency` measures (i.a.) the duration of the 
`StorageProxy::readRegular` function.
`queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 (and 
`cql_start_time = QUEUE`) the measurement happens much earlier

The time distance difference between these two intervals is not accounted for 
in any way, and they are treated as representing measurements of the same 
interval in 
`AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo

This together means that saturating the server with reads, which causes a 
buildup of `Native-Transport-Requests` queue, will inevitably trigger needless 
speculative read retries, as the sampleReadLatency doesn't consider this queue 
wait time.

2. triggering a speculative retry based on time spent coordinating

I assume that speculative retries aim to reduce latency/prevent timeout when 
some replicas that are contacted first (in 
`AbstractReadExecutor::executeAsync`) do not respond timely, for whatever 
reason.
I understand the desire to limit the user-observable latency and thus base the 
retry decision on the closest approximation of the user-observable latency we 
have (`coordinatorReadLatency`). Still, I believe it is a flawed approach.
If the coordination part is fast, and some replicas are late, it doesn't really 
matter if we include coordination time.
If, on the other hand, for whatever reason, the coordination part gets slow, 
triggering additional speculative retries will only make matters worse.

3. `coordinatorReadLatency` metric is counter-intuitive

I understand this point is subjective. That said, I can't find a reason why the 
`coordinatorReadLatency`
should not cover the longest interval feasible, i.e. from `queryStartNanoTime` 
till the end of execution,
instead of a specific portion of the code (which, admittedly, does most of the 
read-related work).

This is another facet of point 1 - not only do we measure two different 
intervals, but both seem to be measured incorrectly.

4. the metric values used to update `sampleReadLatency*` field are sticky to 
histogram bucket ends

This is more something to be aware of rather than a particular problem to 
solve. Still, I felt it is worth mentioning because perhaps there's some 
improvement to be made that I'm not aware of.

We are using the `coordinatorReadLatency` metric as the source for 
`sampleReadLatency*` field.
Let's assume we're using 99th percentile of that metric. At its heart, the 
metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual 
percentile value is computed in `AbstractSnapshot::getValue`.
The value returned by this function is always a bucket boundary 
(`bucketOffsets[i]`).
For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we update 
the Timer only with measurements in the range 21-23ms range, then *ALL* 
percentile values will equal 24ms.

This is different than in other systems. E.g. in Prometheus, a linear 
interpolation between bucket ends is performed to estimate a more fine-grained 
value of a particular sample.

The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket 
covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms 
timespan.

Fortunately, the percentile value we get is the *upper* end of the bucket, 
making the speculative retries more difficult to trigger.

One additional note is that the stickiness of the percentiles stablizes the 
behaviour of percentile-based decision makers, e.g. the 
`DynamicEndpointSnitch`. Implementing linear interpolation in 
`AbstractSnapshot::getValue` may unexpectedly influence such components.

I will follow up with specific improvement proposals.

  was:
This problem affects at least 4.x, newer, and possibly older versions.
CASSANDRA-19215 proposed patch 
(https://github.com/apache/cassandra/pull/3064/files) makes it trivial to hit 
on the trunk.
The description below uses the `trunk` code, but the same problems happen in 
earlier versions even if, e.g., class names are changed, etc.

In one of our tests with RF=3 and QUORUM reads, we observed that the number of 
speculative read retries was of the same order of magnitude as the number of 
client reads.

In effect, this reduced the cluster throughput by roughly 1/3

The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` 
in particular suffer from several problems:

1. triggering a speculative retry on the basis of a comparison of two distinct 
intervals

The speculative retry is triggered on basis of:
- `CFS::sampleReadLatency*` field, which is sourced from the 
`coordinatorReadLatency` metric
- `queryStartNanoTime` (`ReadCallback::await` called by 
`AbstractReadExecutor::shouldSpeculateAndMaybeWait`)

`coordinatorReadLatency` measures (i.a.) the duration of the 
`StorageProxy::readRegular` function.
`queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 the 
measurement happens much earlier

The time distance difference between these two intervals is not accounted for 
in any way, and they are treated as representing measurements of the same 
interval in 
`AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo

This together means that saturating the server with reads, which causes a 
buildup of `Native-Transport-Requests` queue, will inevitably trigger needless 
speculative read retries, as the sampleReadLatency doesn't consider this queue 
wait time.

2. triggering a speculative retry based on time spent coordinating

I assume that speculative retries aim to reduce latency/prevent timeout when 
some replicas that are contacted first (in 
`AbstractReadExecutor::executeAsync`) do not respond timely, for whatever 
reason.
I understand the desire to limit the user-observable latency and thus base the 
retry decision on the closest approximation of the user-observable latency we 
have (`coordinatorReadLatency`). Still, I believe it is a flawed approach.
If the coordination part is fast, and some replicas are late, it doesn't really 
matter if we include coordination time.
If, on the other hand, for whatever reason, the coordination part gets slow, 
triggering additional speculative retries will only make matters worse.

3. `coordinatorReadLatency` metric is counter-intuitive

I understand this point is subjective. That said, I can't find a reason why the 
`coordinatorReadLatency`
should not cover the longest interval feasible, i.e. from `queryStartNanoTime` 
till the end of execution,
instead of a specific portion of the code (which, admittedly, does most of the 
read-related work).

This is another facet of point 1 - not only do we measure two different 
intervals, but both seem to be measured incorrectly.

4. the metric values used to update `sampleReadLatency*` field are sticky to 
histogram bucket ends

This is more something to be aware of rather than a particular problem to 
solve. Still, I felt it is worth mentioning because perhaps there's some 
improvement to be made that I'm not aware of.

We are using the `coordinatorReadLatency` metric as the source for 
`sampleReadLatency*` field.
Let's assume we're using 99th percentile of that metric. At its heart, the 
metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual 
percentile value is computed in `AbstractSnapshot::getValue`.
The value returned by this function is always a bucket boundary 
(`bucketOffsets[i]`).
For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we update 
the Timer only with measurements in the range 21-23ms range, then *ALL* 
percentile values will equal 24ms.

This is different than in other systems. E.g. in Prometheus, a linear 
interpolation between bucket ends is performed to estimate a more fine-grained 
value of a particular sample.

The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket 
covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms 
timespan.

Fortunately, the percentile value we get is the *upper* end of the bucket, 
making the speculative retries more difficult to trigger.

One additional note is that the stickiness of the percentiles stablizes the 
behaviour of percentile-based decision makers, e.g. the 
`DynamicEndpointSnitch`. Implementing linear interpolation in 
`AbstractSnapshot::getValue` may unexpectedly influence such components.

I will follow up with specific improvement proposals.


> speculative read replies may be created too eagerly and swamp the cluster
> -------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19330
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19330
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jakub Zytka
>            Priority: Normal
>
> This problem affects at least 4.x, newer, and possibly older versions.
> CASSANDRA-19215 proposed patch 
> ([https://github.com/apache/cassandra/pull/3064/files]) makes it trivial to 
> hit on the trunk.
> The description below uses the `trunk` code, but the same problems happen in 
> earlier versions even if, e.g., class names are changed, etc.
> In one of our tests with RF=3 and QUORUM reads, we observed that the number 
> of speculative read retries was of the same order of magnitude as the number 
> of client reads.
> In effect, this reduced the cluster throughput by roughly 1/3
> The speculative read retries mechanism and `PercentileSpeculativeRetryPolicy` 
> in particular suffer from several problems:
> 1. triggering a speculative retry on the basis of a comparison of two 
> distinct intervals
> The speculative retry is triggered on basis of:
>  - `CFS::sampleReadLatency*` field, which is sourced from the 
> `coordinatorReadLatency` metric
>  - `queryStartNanoTime` (`ReadCallback::await` called by 
> `AbstractReadExecutor::shouldSpeculateAndMaybeWait`)
> `coordinatorReadLatency` measures (i.a.) the duration of the 
> `StorageProxy::readRegular` function.
> `queryStartNanoTime` measurement happens earlier; with CASSANDRA-19215 (and 
> `cql_start_time = QUEUE`) the measurement happens much earlier
> The time distance difference between these two intervals is not accounted for 
> in any way, and they are treated as representing measurements of the same 
> interval in 
> `AbstractReadExecutor::shouldSpeculateAndMaybeWait`, `ReadCallback::await` duo
> This together means that saturating the server with reads, which causes a 
> buildup of `Native-Transport-Requests` queue, will inevitably trigger 
> needless speculative read retries, as the sampleReadLatency doesn't consider 
> this queue wait time.
> 2. triggering a speculative retry based on time spent coordinating
> I assume that speculative retries aim to reduce latency/prevent timeout when 
> some replicas that are contacted first (in 
> `AbstractReadExecutor::executeAsync`) do not respond timely, for whatever 
> reason.
> I understand the desire to limit the user-observable latency and thus base 
> the retry decision on the closest approximation of the user-observable 
> latency we have (`coordinatorReadLatency`). Still, I believe it is a flawed 
> approach.
> If the coordination part is fast, and some replicas are late, it doesn't 
> really matter if we include coordination time.
> If, on the other hand, for whatever reason, the coordination part gets slow, 
> triggering additional speculative retries will only make matters worse.
> 3. `coordinatorReadLatency` metric is counter-intuitive
> I understand this point is subjective. That said, I can't find a reason why 
> the `coordinatorReadLatency`
> should not cover the longest interval feasible, i.e. from 
> `queryStartNanoTime` till the end of execution,
> instead of a specific portion of the code (which, admittedly, does most of 
> the read-related work).
> This is another facet of point 1 - not only do we measure two different 
> intervals, but both seem to be measured incorrectly.
> 4. the metric values used to update `sampleReadLatency*` field are sticky to 
> histogram bucket ends
> This is more something to be aware of rather than a particular problem to 
> solve. Still, I felt it is worth mentioning because perhaps there's some 
> improvement to be made that I'm not aware of.
> We are using the `coordinatorReadLatency` metric as the source for 
> `sampleReadLatency*` field.
> Let's assume we're using 99th percentile of that metric. At its heart, the 
> metric is powered by a `DecayingEstimatedHistogramReservoir`. The actual 
> percentile value is computed in `AbstractSnapshot::getValue`.
> The value returned by this function is always a bucket boundary 
> (`bucketOffsets[i]`).
> For example, if we have bucket boundaries ...,20.5ms, 24ms, ..., and we 
> update the Timer only with measurements in the range 21-23ms range, then 
> *ALL* percentile values will equal 24ms.
> This is different than in other systems. E.g. in Prometheus, a linear 
> interpolation between bucket ends is performed to estimate a more 
> fine-grained value of a particular sample.
> The bucket boundaries grow by ~20% each, so at the level of 20ms, a bucket 
> covers roughly 4ms timespan; at the level of 100ms, it covers roughly 20ms 
> timespan.
> Fortunately, the percentile value we get is the *upper* end of the bucket, 
> making the speculative retries more difficult to trigger.
> One additional note is that the stickiness of the percentiles stablizes the 
> behaviour of percentile-based decision makers, e.g. the 
> `DynamicEndpointSnitch`. Implementing linear interpolation in 
> `AbstractSnapshot::getValue` may unexpectedly influence such components.
> I will follow up with specific improvement proposals.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to