[ 
https://issues.apache.org/jira/browse/FLINK-39958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-39958.
------------------------------
    Fix Version/s: kubernetes-operator-1.16.0
         Assignee: Dennis-Mircea Ciupitu
       Resolution: Fixed

merged to main d971aa3ef0dec19cd98361162a252e1ede575ab1

> Autoscaler Flink REST client timeout is silently overridden by the operator 
> client timeout
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39958
>                 URL: https://issues.apache.org/jira/browse/FLINK-39958
>             Project: Flink
>          Issue Type: Bug
>          Components: Autoscaler, Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.15.0
>            Reporter: Dennis-Mircea Ciupitu
>            Assignee: Dennis-Mircea Ciupitu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.16.0
>
>
> h1. Problem
> The autoscaler option {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} (key 
> {{job.autoscaler.flink.rest-client.timeout}}, fallback key 
> {{kubernetes.operator.flink.rest-client.timeout}}, default {{10s}}) is 
> advertised and documented, but it has no effect when the autoscaler runs 
> inside the Kubernetes operator. Any value a user explicitly sets for it is 
> silently discarded.
> h1. Root cause
> When the operator constructs the autoscaler context, it first ingests the 
> resource's effective deploy configuration, which already includes any 
> user-provided {{job.autoscaler.flink.rest-client.timeout}} from 
> {{spec.flinkConfiguration}}, and then unconditionally overwrites 
> {{AutoScalerOptions.FLINK_CLIENT_TIMEOUT}} with the operator-level 
> {{OPERATOR_FLINK_CLIENT_TIMEOUT}} 
> ({{kubernetes.operator.flink.client.timeout}}). Because this is an 
> unconditional override rather than a default, the user's explicit autoscaler 
> value is always clobbered.
> h1. Impact
> * The autoscaler REST-client timeout option is effectively a no-op in 
> operator mode. A user who follows the autoscaler documentation and sets 
> {{job.autoscaler.flink.rest-client.timeout}} sees no effect.
> * The behavior is inconsistent across deployment modes: the same option works 
> in the standalone autoscaler (which has no operator config to override it) 
> but is dead inside the operator.
> * Severity is low, because both options default to {{10s}}, so the override 
> is invisible unless a user explicitly tunes the autoscaler option.
> h1. Expected behavior
> The operator's client timeout should continue to act as the default for the 
> autoscaler, so that the autoscaler does not time out earlier or later than 
> the rest of the operator's Flink REST interactions. However, an explicitly 
> configured {{job.autoscaler.flink.rest-client.timeout}} must be honored 
> instead of being silently overwritten. In other words, the operator timeout 
> should be applied as a default/fallback, not as an unconditional override.
> h1. Notes
> * Backward compatible: with both options at the default {{10s}} nothing 
> changes. Only users who explicitly tune the autoscaler option are affected, 
> and for them the value now takes effect as documented.
> * This is a behavioral fix in core reconcile / autoscaler wiring, so it 
> warrants a JIRA rather than a hotfix.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to