[ 
https://issues.apache.org/jira/browse/FLINK-39299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-39299.
------------------------------
    Fix Version/s: kubernetes-operator-1.15.0
                       (was: 1.15.5)
       Resolution: Fixed

merged to main 14f04033fc18fbde91bca14b6d38bbdee4ad3722

> Inconsistent vertex parallelism alignment logic
> -----------------------------------------------
>
>                 Key: FLINK-39299
>                 URL: https://issues.apache.org/jira/browse/FLINK-39299
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: 1.15.4
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Critical
>              Labels: autoscaling, operator, pull-request-available
>             Fix For: kubernetes-operator-1.15.0
>
>
> h1. Overview
> The parallelism alignment logic in {{JobVertexScaler#scale}} can silently 
> invert the intended scaling direction.
> The alignment uses a two-phase search:
> - Phase 1: scans upward from {{newParallelism}} to {{upperBoundForAlignment}} 
> for a divisor-aligned value.
> - Phase 2: if Phase 1 finds nothing, scans downward from {{newParallelism}} 
> for the nearest value where per-subtask load changes, then snaps up to the 
> closest aligned value.
> Neither phase is direction-aware, which causes two bugs:
> 1. *Scale-down inverted to scale-up* - Phase 1 searches upward with no cap 
> relative to currentParallelism. During a scale-down (e.g., from 20 to 18), it 
> can find a divisor above currentParallelism and return it, turning the 
> scale-down into a scale-up.
> 2. *Scale-up inverted to scale-down* - Phase 2 searches downward with no 
> floor relative to currentParallelism. During a scale-up (e.g., from 22 to 25 
> with {{parallelismLowerLimit=20}}), it can settle on a value like 20 (below 
> {{currentParallelism}}) turning the scale-up into a scale-down.
> h1. Proposed Solution
> Add two direction-safety guards to the existing algorithm:
> 1. *Phase 1 cap:* For scale-down, the upward search upper bound is capped at 
> {{currentParallelism}}. This prevents the search from returning a value above 
> {{currentParallelism}}. If {{currentParallelism}} itself is the nearest 
> divisor, it is returned (blocking the scale-down).
> 2. *Phase 2 guard:* After the downward fallback, a check ensures the result 
> doesn't invert the direction ({{p ≤ currentParallelism}} for scale-up, {{p ≥ 
> currentParallelism}} for scale-down). If it would, {{currentParallelism}} is 
> returned and a {{ScalingLimited}} warning is emitted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to