[ 
https://issues.apache.org/jira/browse/FLINK-39370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora updated FLINK-39370:
-------------------------------
    Fix Version/s:     (was: kubernetes-operator-1.15.0)

> In-place scaling check only inspects K8s resource spec for adaptive 
> scheduler, ignoring JM running configuration
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39370
>                 URL: https://issues.apache.org/jira/browse/FLINK-39370
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.14.0
>            Reporter: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: autoscaling, operator, pull-request-available
>
> h1. Overview
> {{NativeFlinkService.supportsInPlaceScaling()}} determines whether a running 
> Flink job supports in-place rescaling by checking whether 
> {{jobmanager.scheduler}} is set to {{Adaptive}} in the {_}observe 
> configuration{_}, which is derived from the Kubernetes {{FlinkDeployment}} 
> resource spec.
> However, the adaptive scheduler can be configured through several mechanisms 
> that are *not* reflected in the K8s resource spec:
>  * Flink's native {{flink-conf.yaml}} (baked into the Docker image or mounted 
> via ConfigMap)
>  * Environment variables
>  * Dynamic properties passed via command-line arguments
> When the adaptive scheduler is configured through any of these alternative 
> mechanisms, {{observeConfig.get(SCHEDULER)}} returns the default value 
> ({{default}}), and {{supportsInPlaceScaling()}} incorrectly returns 
> {{false}}, even though the JobManager is actually running with the adaptive 
> scheduler and fully supports in-place rescaling.
> This forces an unnecessary full restart/redeploy cycle instead of a 
> lightweight in-place scaling operation.
> h1. Expected Behavior
> The operator should detect that the running JobManager is using the adaptive 
> scheduler (by querying its REST API) and proceed with in-place scaling.
> h1. Actual Behavior
> The operator only checks the {{FlinkDeployment}} spec configuration, finds no 
> {{jobmanager.scheduler: Adaptive}} entry, and falls back to a full restart.
> h1. Proposed Fix
> The fix is:
> 1. Change {{supportsInPlaceScaling()}} from a static check to an instance 
> method that
> 2. When the K8s resource spec does not indicate the adaptive scheduler, falls 
> back to querying the JobManager's running configuration via the 
> {{JobManagerJobConfigurationHeaders}} REST endpoint. If the JM's actual 
> running config confirms {{{}jobmanager.scheduler: Adaptive{}}}, in-place 
> scaling proceeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to