[
https://issues.apache.org/jira/browse/FLINK-39370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora updated FLINK-39370:
-------------------------------
Fix Version/s: (was: kubernetes-operator-1.15.0)
> In-place scaling check only inspects K8s resource spec for adaptive
> scheduler, ignoring JM running configuration
> ----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-39370
> URL: https://issues.apache.org/jira/browse/FLINK-39370
> Project: Flink
> Issue Type: Bug
> Components: Autoscaler, Kubernetes Operator
> Affects Versions: kubernetes-operator-1.14.0
> Reporter: Dennis-Mircea Ciupitu
> Priority: Major
> Labels: autoscaling, operator, pull-request-available
>
> h1. Overview
> {{NativeFlinkService.supportsInPlaceScaling()}} determines whether a running
> Flink job supports in-place rescaling by checking whether
> {{jobmanager.scheduler}} is set to {{Adaptive}} in the {_}observe
> configuration{_}, which is derived from the Kubernetes {{FlinkDeployment}}
> resource spec.
> However, the adaptive scheduler can be configured through several mechanisms
> that are *not* reflected in the K8s resource spec:
> * Flink's native {{flink-conf.yaml}} (baked into the Docker image or mounted
> via ConfigMap)
> * Environment variables
> * Dynamic properties passed via command-line arguments
> When the adaptive scheduler is configured through any of these alternative
> mechanisms, {{observeConfig.get(SCHEDULER)}} returns the default value
> ({{default}}), and {{supportsInPlaceScaling()}} incorrectly returns
> {{false}}, even though the JobManager is actually running with the adaptive
> scheduler and fully supports in-place rescaling.
> This forces an unnecessary full restart/redeploy cycle instead of a
> lightweight in-place scaling operation.
> h1. Expected Behavior
> The operator should detect that the running JobManager is using the adaptive
> scheduler (by querying its REST API) and proceed with in-place scaling.
> h1. Actual Behavior
> The operator only checks the {{FlinkDeployment}} spec configuration, finds no
> {{jobmanager.scheduler: Adaptive}} entry, and falls back to a full restart.
> h1. Proposed Fix
> The fix is:
> 1. Change {{supportsInPlaceScaling()}} from a static check to an instance
> method that
> 2. When the K8s resource spec does not indicate the adaptive scheduler, falls
> back to querying the JobManager's running configuration via the
> {{JobManagerJobConfigurationHeaders}} REST endpoint. If the JM's actual
> running config confirms {{{}jobmanager.scheduler: Adaptive{}}}, in-place
> scaling proceeds.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)