Dennis-Mircea Ciupitu created FLINK-39549:
---------------------------------------------

             Summary: Compute OBSERVED_TPR for non-source vertices behind a 
feature flag for Operator Autoscaler
                 Key: FLINK-39549
                 URL: https://issues.apache.org/jira/browse/FLINK-39549
             Project: Flink
          Issue Type: Improvement
          Components: Autoscaler, Kubernetes Operator
    Affects Versions: kubernetes-operator-1.14.0
            Reporter: Dennis-Mircea Ciupitu
             Fix For: kubernetes-operator-1.15.0


h1. Summary

The autoscaler currently computes the backpressure-derived observed true 
processing rate ({{OBSERVED_TPR}}) only for source vertices. For every 
downstream (non-source) vertex, the evaluator falls back exclusively to the 
busy-time based true processing rate. This creates an alignment asymmetry in 
how scaling decisions are made across the job graph.

h1. Problem
Sources may be evaluated using a backpressure-aware processing rate while their 
downstream vertices are evaluated purely from busy-time. When the actual 
bottleneck of the pipeline is downstream of the sources, this asymmetry can 
lead to inconsistent or suboptimal scaling decisions:
- Sources observed as backpressured may scale based on the observed 
(capacity-aware) rate.
- The true bottleneck vertex further down the graph is scaled based on 
busy-time alone, which under sustained backpressure is itself known to be 
unreliable.

h1. Goal
Remove the source-vs-non-source asymmetry by extending the observed true 
processing rate computation to non-source vertices, so that every vertex in the 
graph is, when appropriate conditions are met, evaluated using the same 
capacity model. This should be a per-vertex, opt-in capability that preserves 
existing behavior by default.

h1. Scope
- Generalize observed-TPR computation so it can apply to any vertex, not just 
sources.
- Define a non-source trigger that is the semantic dual of the source-side 
gate, so that the new behavior aligns with, rather than diverges from the 
existing source path.
- Keep the change strictly opt-in via a new feature flag, defaulting to off, 
with a tunable threshold for safe rollout.
- Preserve the existing source path and its semantics unchanged.

h1. Compatibility & Risks
- Default behavior is unchanged; existing deployments are unaffected unless the 
new flag is explicitly enabled.
- When enabled, the new path is gated conservatively so it activates only when 
the per-vertex measurement is meaningful, mitigating the risk of premature or 
noisy switches in scaling decisions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to