[
https://issues.apache.org/jira/browse/FLINK-39299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-39299.
------------------------------
Fix Version/s: kubernetes-operator-1.15.0
(was: 1.15.5)
Resolution: Fixed
merged to main 14f04033fc18fbde91bca14b6d38bbdee4ad3722
> Inconsistent vertex parallelism alignment logic
> -----------------------------------------------
>
> Key: FLINK-39299
> URL: https://issues.apache.org/jira/browse/FLINK-39299
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: 1.15.4
> Reporter: Dennis-Mircea Ciupitu
> Priority: Critical
> Labels: autoscaling, operator, pull-request-available
> Fix For: kubernetes-operator-1.15.0
>
>
> h1. Overview
> The parallelism alignment logic in {{JobVertexScaler#scale}} can silently
> invert the intended scaling direction.
> The alignment uses a two-phase search:
> - Phase 1: scans upward from {{newParallelism}} to {{upperBoundForAlignment}}
> for a divisor-aligned value.
> - Phase 2: if Phase 1 finds nothing, scans downward from {{newParallelism}}
> for the nearest value where per-subtask load changes, then snaps up to the
> closest aligned value.
> Neither phase is direction-aware, which causes two bugs:
> 1. *Scale-down inverted to scale-up* - Phase 1 searches upward with no cap
> relative to currentParallelism. During a scale-down (e.g., from 20 to 18), it
> can find a divisor above currentParallelism and return it, turning the
> scale-down into a scale-up.
> 2. *Scale-up inverted to scale-down* - Phase 2 searches downward with no
> floor relative to currentParallelism. During a scale-up (e.g., from 22 to 25
> with {{parallelismLowerLimit=20}}), it can settle on a value like 20 (below
> {{currentParallelism}}) turning the scale-up into a scale-down.
> h1. Proposed Solution
> Add two direction-safety guards to the existing algorithm:
> 1. *Phase 1 cap:* For scale-down, the upward search upper bound is capped at
> {{currentParallelism}}. This prevents the search from returning a value above
> {{currentParallelism}}. If {{currentParallelism}} itself is the nearest
> divisor, it is returned (blocking the scale-down).
> 2. *Phase 2 guard:* After the downward fallback, a check ensures the result
> doesn't invert the direction ({{p ≤ currentParallelism}} for scale-up, {{p ≥
> currentParallelism}} for scale-down). If it would, {{currentParallelism}} is
> returned and a {{ScalingLimited}} warning is emitted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)