Dennis-Mircea Ciupitu created FLINK-39549:
---------------------------------------------
Summary: Compute OBSERVED_TPR for non-source vertices behind a
feature flag for Operator Autoscaler
Key: FLINK-39549
URL: https://issues.apache.org/jira/browse/FLINK-39549
Project: Flink
Issue Type: Improvement
Components: Autoscaler, Kubernetes Operator
Affects Versions: kubernetes-operator-1.14.0
Reporter: Dennis-Mircea Ciupitu
Fix For: kubernetes-operator-1.15.0
h1. Summary
The autoscaler currently computes the backpressure-derived observed true
processing rate ({{OBSERVED_TPR}}) only for source vertices. For every
downstream (non-source) vertex, the evaluator falls back exclusively to the
busy-time based true processing rate. This creates an alignment asymmetry in
how scaling decisions are made across the job graph.
h1. Problem
Sources may be evaluated using a backpressure-aware processing rate while their
downstream vertices are evaluated purely from busy-time. When the actual
bottleneck of the pipeline is downstream of the sources, this asymmetry can
lead to inconsistent or suboptimal scaling decisions:
- Sources observed as backpressured may scale based on the observed
(capacity-aware) rate.
- The true bottleneck vertex further down the graph is scaled based on
busy-time alone, which under sustained backpressure is itself known to be
unreliable.
h1. Goal
Remove the source-vs-non-source asymmetry by extending the observed true
processing rate computation to non-source vertices, so that every vertex in the
graph is, when appropriate conditions are met, evaluated using the same
capacity model. This should be a per-vertex, opt-in capability that preserves
existing behavior by default.
h1. Scope
- Generalize observed-TPR computation so it can apply to any vertex, not just
sources.
- Define a non-source trigger that is the semantic dual of the source-side
gate, so that the new behavior aligns with, rather than diverges from the
existing source path.
- Keep the change strictly opt-in via a new feature flag, defaulting to off,
with a tunable threshold for safe rollout.
- Preserve the existing source path and its semantics unchanged.
h1. Compatibility & Risks
- Default behavior is unchanged; existing deployments are unaffected unless the
new flag is explicitly enabled.
- When enabled, the new path is gated conservatively so it activates only when
the per-vertex measurement is meaningful, mitigating the risk of premature or
noisy switches in scaling decisions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)