[ 
https://issues.apache.org/jira/browse/FLINK-39306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis-Mircea Ciupitu updated FLINK-39306:
------------------------------------------
    Summary: Align the busy-time TRUE_PROCESSING_RATE numerator estimator with 
the busy-time aggregator  (was: Non-source vertices do not use per-second rate 
metrics, producing inaccurate scaling decisions)

> Align the busy-time TRUE_PROCESSING_RATE numerator estimator with the 
> busy-time aggregator
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39306
>                 URL: https://issues.apache.org/jira/browse/FLINK-39306
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: autoscaling, operator, pull-request-available
>             Fix For: kubernetes-operator-1.16.0
>
>
> h1. Summary
> The busy-time {{TRUE_PROCESSING_RATE}} is computed as a ratio, 
> {{{}busyTimeTpr = inputRate / busyTime{}}}, but the numerator and denominator 
> are estimated over the metric window with different methods. The denominator 
> is computed consistently with the configured busy-time aggregator, while the 
> numerator always uses {{getRate}} on the cumulative records counter. Under 
> the default {{MAX}} aggregator the two halves of the ratio therefore use 
> different temporal estimators, which is internally inconsistent and can skew 
> the ratio under non-uniform metric sampling. This issue aligns the numerator 
> estimator with the denominator so the ratio is internally consistent.
> h1. Background
> {{TRUE_PROCESSING_RATE}} (the vertex capacity used by the scaler) is the 
> larger of two sub-paths, a busy-time based estimate and an observed estimate, 
> selected by {{{}selectTprMetric{}}}.
> The busy-time estimate is {{{}busyTimeTpr = inputRate / (busyTimeAvg / 
> 1000){}}}. The denominator {{busyTimeAvg}} depends on the 
> {{kubernetes.operator.metrics.busy-time.aggregator}} option:
>  * {{{}AVG{}}}: {{{}getRate(ACCUMULATED_BUSY_TIME) / parallelism{}}}, a 
> cumulative (time-integral) rate.
>  * {{MAX}} or {{MIN}} (default {{{}MAX{}}}): {{{}getAverage(LOAD) * 1000{}}}, 
> an arithmetic mean of the per-second busy-time gauge samples. The numerator 
> {{inputRate}} is always {{{}getRate(NUM_RECORDS_IN){}}}, a cumulative 
> endpoint rate.
> {{getRate}} (a time-weighted, time-integral average) and {{getAverage}} of a 
> per-second gauge (an unweighted sample mean) are two different linear 
> estimators. They agree under uniform sampling but diverge under non-uniform 
> sampling (bursts, recovery, transients).
> h1. Problem
> Under the default {{MAX}} aggregator, {{{}busyTimeTpr{}}}'s denominator is a 
> per-second sample mean while its numerator is a cumulative endpoint rate. 
> Because the value is a ratio of two co-varying quantities, using a single 
> shared estimator for both lets their common sampling weighting cancel, 
> whereas mixing estimators leaves the denominator's sampling artifact in the 
> result. The numerator is also the odd one out relative to the observed 
> estimate it is compared against in {{{}selectTprMetric{}}}, which is built 
> entirely from per-second gauges.
> h1. Goal
> Make the busy-time {{TRUE_PROCESSING_RATE}} numerator follow the same 
> estimator as its busy-time denominator: per-second gauge mean under {{MAX}} 
> or {{{}MIN{}}}, cumulative {{getRate}} under {{{}AVG{}}}. This is an 
> internal-consistency fix for the ratio. It is scoped to that numerator only. 
> It is not a change to the demand or edge data-rate paths, and it does not 
> attempt to revalidate the underlying capacity model, the observed-rate 
> formula, or the subtask aggregation, which are out of scope.
> h1. Notes
> Behavior is unchanged under {{AVG}} and unchanged whenever metric sampling is 
> uniform (the common steady state). The new estimator only differs under 
> non-uniform sampling. Covered by unit tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to