[ 
https://issues.apache.org/jira/browse/FLINK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora updated FLINK-39549:
-------------------------------
    Fix Version/s:     (was: kubernetes-operator-1.15.0)

> Compute OBSERVED_TPR for non-source vertices behind a feature flag for 
> Operator Autoscaler
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39549
>                 URL: https://issues.apache.org/jira/browse/FLINK-39549
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Summary
> The autoscaler currently computes the backpressure-derived observed true 
> processing rate ({{OBSERVED_TPR}}) only for source vertices. For every 
> downstream (non-source) vertex, the evaluator falls back exclusively to the 
> busy-time based true processing rate. This creates an alignment asymmetry in 
> how scaling decisions are made across the job graph.
> h1. Problem
> Sources may be evaluated using a backpressure-aware processing rate while 
> their downstream vertices are evaluated purely from busy-time. When the 
> actual bottleneck of the pipeline is downstream of the sources, this 
> asymmetry can lead to inconsistent or suboptimal scaling decisions:
> - Sources observed as backpressured may scale based on the observed 
> (capacity-aware) rate.
> - The true bottleneck vertex further down the graph is scaled based on 
> busy-time alone, which under sustained backpressure is itself known to be 
> unreliable.
> h1. Goal
> Remove the source-vs-non-source asymmetry by extending the observed true 
> processing rate computation to non-source vertices, so that every vertex in 
> the graph is, when appropriate conditions are met, evaluated using the same 
> capacity model. This should be a per-vertex, opt-in capability that preserves 
> existing behavior by default.
> h1. Scope
> - Generalize observed-TPR computation so it can apply to any vertex, not just 
> sources.
> - Define a non-source trigger that is the semantic dual of the source-side 
> gate, so that the new behavior aligns with, rather than diverges from the 
> existing source path.
> - Keep the change strictly opt-in via a new feature flag, defaulting to off, 
> with a tunable threshold for safe rollout.
> - Preserve the existing source path and its semantics unchanged.
> h1. Compatibility & Risks
> - Default behavior is unchanged; existing deployments are unaffected unless 
> the new flag is explicitly enabled.
> - When enabled, the new path is gated conservatively so it activates only 
> when the per-vertex measurement is meaningful, mitigating the risk of 
> premature or noisy switches in scaling decisions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to