[
https://issues.apache.org/jira/browse/FLINK-39306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dennis-Mircea Ciupitu updated FLINK-39306:
------------------------------------------
Description:
h1. Summary
The busy-time {{TRUE_PROCESSING_RATE}} is computed as a ratio, {{{}busyTimeTpr
= inputRate / busyTime{}}}, but the numerator and denominator are estimated
over the metric window with different methods. The denominator is computed
consistently with the configured busy-time aggregator, while the numerator
always uses {{getRate}} on the cumulative records counter. Under the default
{{MAX}} aggregator the two halves of the ratio therefore use different temporal
estimators, which is internally inconsistent and can skew the ratio under
non-uniform metric sampling. This issue aligns the numerator estimator with the
denominator so the ratio is internally consistent.
h1. Background
{{TRUE_PROCESSING_RATE}} (the vertex capacity used by the scaler) is the larger
of two sub-paths, a busy-time based estimate and an observed estimate, selected
by {{{}selectTprMetric{}}}.
The busy-time estimate is {{{}busyTimeTpr = inputRate / (busyTimeAvg /
1000){}}}. The denominator {{busyTimeAvg}} depends on the
{{kubernetes.operator.metrics.busy-time.aggregator}} option:
* {{{}AVG{}}}: {{{}getRate(ACCUMULATED_BUSY_TIME) / parallelism{}}}, a
cumulative (time-integral) rate.
* {{MAX}} or {{MIN}} (default {{{}MAX{}}}): {{{}getAverage(LOAD) * 1000{}}},
an arithmetic mean of the per-second busy-time gauge samples. The numerator
{{inputRate}} is always {{{}getRate(NUM_RECORDS_IN){}}}, a cumulative endpoint
rate.
{{getRate}} (a time-weighted, time-integral average) and {{getAverage}} of a
per-second gauge (an unweighted sample mean) are two different linear
estimators. They agree under uniform sampling but diverge under non-uniform
sampling (bursts, recovery, transients).
h1. Problem
Under the default {{MAX}} aggregator, {{{}busyTimeTpr{}}}'s denominator is a
per-second sample mean while its numerator is a cumulative endpoint rate.
Because the value is a ratio of two co-varying quantities, using a single
shared estimator for both lets their common sampling weighting cancel, whereas
mixing estimators leaves the denominator's sampling artifact in the result. The
numerator is also the odd one out relative to the observed estimate it is
compared against in {{{}selectTprMetric{}}}, which is built entirely from
per-second gauges.
h1. Goal
Make the busy-time {{TRUE_PROCESSING_RATE}} numerator follow the same estimator
as its busy-time denominator: per-second gauge mean under {{MAX}} or
{{{}MIN{}}}, cumulative {{getRate}} under {{{}AVG{}}}. This is an
internal-consistency fix for the ratio. It is scoped to that numerator only. It
is not a change to the demand or edge data-rate paths, and it does not attempt
to revalidate the underlying capacity model, the observed-rate formula, or the
subtask aggregation, which are out of scope.
h1. Notes
Behavior is unchanged under {{AVG}} and unchanged whenever metric sampling is
uniform (the common steady state). The new estimator only differs under
non-uniform sampling. Covered by unit tests.
was:
h1. Summary
The busy-time {{TRUE_PROCESSING_RATE}} is computed as a ratio, {{busyTimeTpr =
inputRate / busyTime}}, but the numerator and denominator are estimated over
the metric window with different methods. The denominator is computed
consistently with the configured busy-time aggregator, while the numerator
always uses {{getRate}} on the cumulative records counter. Under the default
{{MAX}} aggregator the two halves of the ratio therefore use different temporal
estimators, which is internally inconsistent and can skew the ratio under
non-uniform metric sampling. This issue aligns the numerator estimator with the
denominator so the ratio is internally consistent.
h1. Background
{{TRUE_PROCESSING_RATE}} (the vertex capacity used by the scaler) is the larger
of two sub-paths, a busy-time based estimate and an observed estimate, selected
by {{selectTprMetric}}.
The busy-time estimate is {{busyTimeTpr = inputRate / (busyTimeAvg / 1000)}}.
The denominator {{busyTimeAvg}} depends on the
{{kubernetes.operator.metrics.busy-time.aggregator}} option:
{{AVG}}: {{getRate(ACCUMULATED_BUSY_TIME) / parallelism}}, a cumulative
(time-integral) rate.
{{MAX}} or {{MIN}} (default {{MAX}}): {{getAverage(LOAD) * 1000}}, an
arithmetic mean of the per-second busy-time gauge samples.
The numerator {{inputRate}} is always {{getRate(NUM_RECORDS_IN)}}, a cumulative
endpoint rate.
{{getRate}} (a time-weighted, time-integral average) and {{getAverage}} of a
per-second gauge (an unweighted sample mean) are two different linear
estimators. They agree under uniform sampling but diverge under non-uniform
sampling (bursts, recovery, transients).
h1. Problem
Under the default {{MAX}} aggregator, {{busyTimeTpr}}'s denominator is a
per-second sample mean while its numerator is a cumulative endpoint rate.
Because the value is a ratio of two co-varying quantities, using a single
shared estimator for both lets their common sampling weighting cancel, whereas
mixing estimators leaves the denominator's sampling artifact in the result. The
numerator is also the odd one out relative to the observed estimate it is
compared against in {{selectTprMetric}}, which is built entirely from
per-second gauges.
h1. Goal
Make the busy-time {{TRUE_PROCESSING_RATE}} numerator follow the same estimator
as its busy-time denominator: per-second gauge mean under {{MAX}} or {{MIN}},
cumulative {{getRate}} under {{AVG}}. This is an internal-consistency fix for
the ratio. It is scoped to that numerator only. It is not a change to the
demand or edge data-rate paths, and it does not attempt to revalidate the
underlying capacity model, the observed-rate formula, or the subtask
aggregation, which are out of scope.
h1. Notes
Behavior is unchanged under {{AVG}} and unchanged whenever metric sampling is
uniform (the common steady state). The new estimator only differs under
non-uniform sampling. Covered by unit tests.
> Non-source vertices do not use per-second rate metrics, producing inaccurate
> scaling decisions
> ----------------------------------------------------------------------------------------------
>
> Key: FLINK-39306
> URL: https://issues.apache.org/jira/browse/FLINK-39306
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: autoscaling, operator, pull-request-available
> Fix For: kubernetes-operator-1.16.0
>
>
> h1. Summary
> The busy-time {{TRUE_PROCESSING_RATE}} is computed as a ratio,
> {{{}busyTimeTpr = inputRate / busyTime{}}}, but the numerator and denominator
> are estimated over the metric window with different methods. The denominator
> is computed consistently with the configured busy-time aggregator, while the
> numerator always uses {{getRate}} on the cumulative records counter. Under
> the default {{MAX}} aggregator the two halves of the ratio therefore use
> different temporal estimators, which is internally inconsistent and can skew
> the ratio under non-uniform metric sampling. This issue aligns the numerator
> estimator with the denominator so the ratio is internally consistent.
> h1. Background
> {{TRUE_PROCESSING_RATE}} (the vertex capacity used by the scaler) is the
> larger of two sub-paths, a busy-time based estimate and an observed estimate,
> selected by {{{}selectTprMetric{}}}.
> The busy-time estimate is {{{}busyTimeTpr = inputRate / (busyTimeAvg /
> 1000){}}}. The denominator {{busyTimeAvg}} depends on the
> {{kubernetes.operator.metrics.busy-time.aggregator}} option:
> * {{{}AVG{}}}: {{{}getRate(ACCUMULATED_BUSY_TIME) / parallelism{}}}, a
> cumulative (time-integral) rate.
> * {{MAX}} or {{MIN}} (default {{{}MAX{}}}): {{{}getAverage(LOAD) * 1000{}}},
> an arithmetic mean of the per-second busy-time gauge samples. The numerator
> {{inputRate}} is always {{{}getRate(NUM_RECORDS_IN){}}}, a cumulative
> endpoint rate.
> {{getRate}} (a time-weighted, time-integral average) and {{getAverage}} of a
> per-second gauge (an unweighted sample mean) are two different linear
> estimators. They agree under uniform sampling but diverge under non-uniform
> sampling (bursts, recovery, transients).
> h1. Problem
> Under the default {{MAX}} aggregator, {{{}busyTimeTpr{}}}'s denominator is a
> per-second sample mean while its numerator is a cumulative endpoint rate.
> Because the value is a ratio of two co-varying quantities, using a single
> shared estimator for both lets their common sampling weighting cancel,
> whereas mixing estimators leaves the denominator's sampling artifact in the
> result. The numerator is also the odd one out relative to the observed
> estimate it is compared against in {{{}selectTprMetric{}}}, which is built
> entirely from per-second gauges.
> h1. Goal
> Make the busy-time {{TRUE_PROCESSING_RATE}} numerator follow the same
> estimator as its busy-time denominator: per-second gauge mean under {{MAX}}
> or {{{}MIN{}}}, cumulative {{getRate}} under {{{}AVG{}}}. This is an
> internal-consistency fix for the ratio. It is scoped to that numerator only.
> It is not a change to the demand or edge data-rate paths, and it does not
> attempt to revalidate the underlying capacity model, the observed-rate
> formula, or the subtask aggregation, which are out of scope.
> h1. Notes
> Behavior is unchanged under {{AVG}} and unchanged whenever metric sampling is
> uniform (the common steady state). The new estimator only differs under
> non-uniform sampling. Covered by unit tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)