[
https://issues.apache.org/jira/browse/FLINK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-39549:
-------------------------------
Fix Version/s: (was: kubernetes-operator-1.15.0)
> Compute OBSERVED_TPR for non-source vertices behind a feature flag for
> Operator Autoscaler
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-39549
> URL: https://issues.apache.org/jira/browse/FLINK-39549
> Project: Flink
> Issue Type: Improvement
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: pull-request-available
>
> h1. Summary
> The autoscaler currently computes the backpressure-derived observed true
> processing rate ({{OBSERVED_TPR}}) only for source vertices. For every
> downstream (non-source) vertex, the evaluator falls back exclusively to the
> busy-time based true processing rate. This creates an alignment asymmetry in
> how scaling decisions are made across the job graph.
> h1. Problem
> Sources may be evaluated using a backpressure-aware processing rate while
> their downstream vertices are evaluated purely from busy-time. When the
> actual bottleneck of the pipeline is downstream of the sources, this
> asymmetry can lead to inconsistent or suboptimal scaling decisions:
> - Sources observed as backpressured may scale based on the observed
> (capacity-aware) rate.
> - The true bottleneck vertex further down the graph is scaled based on
> busy-time alone, which under sustained backpressure is itself known to be
> unreliable.
> h1. Goal
> Remove the source-vs-non-source asymmetry by extending the observed true
> processing rate computation to non-source vertices, so that every vertex in
> the graph is, when appropriate conditions are met, evaluated using the same
> capacity model. This should be a per-vertex, opt-in capability that preserves
> existing behavior by default.
> h1. Scope
> - Generalize observed-TPR computation so it can apply to any vertex, not just
> sources.
> - Define a non-source trigger that is the semantic dual of the source-side
> gate, so that the new behavior aligns with, rather than diverges from the
> existing source path.
> - Keep the change strictly opt-in via a new feature flag, defaulting to off,
> with a tunable threshold for safe rollout.
> - Preserve the existing source path and its semantics unchanged.
> h1. Compatibility & Risks
> - Default behavior is unchanged; existing deployments are unaffected unless
> the new flag is explicitly enabled.
> - When enabled, the new path is gated conservatively so it activates only
> when the per-vertex measurement is meaningful, mitigating the risk of
> premature or noisy switches in scaling decisions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)