[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Gyula Fora (Jira) Tue, 04 Mar 2025 22:43:24 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-37411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17932524#comment-17932524
 ]


Gyula Fora commented on FLINK-37411:
------------------------------------

I am still not convinced that this is the right direction, especially for the 
operator.

Let me give you an example. What happens if the user itself introduces a 
breaking change at any point in time? (independent of the autoscaler or 
parallelism settings)

Will the autoscaler rollback the parallelism only? How will it distinguish 
errors coming from parallelism vs config/other changes?
If it cannot determine the source of the alert it should roll back both 
parallelism and other config changes but how could it possibly do that?

I think these questions show that we are trying to put in a very large scope 
feature in a component that doesn't have and should not have enough info to 
execute it. 

I think what we need is a good way to signal broken parallelism settings to the 
autoscaler instead. With that the operator can do the rollback and then signal 
the autoscaler. For standalone we can build a rollback mechanism, in that case 
standalone is lightweight dummier version of the operator.

> Introduce the rollback mechanism for Autoscaler 
> ------------------------------------------------
>
>                 Key: FLINK-37411
>                 URL: https://issues.apache.org/jira/browse/FLINK-37411
>             Project: Flink
>          Issue Type: New Feature
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>             Fix For: kubernetes-operator-1.12.0
>
>
> h1. Background & Motivation
> In some cases, job becomes unhealthy(cannot running normally) after job is 
> scaled by autoscaler.
> One option is rolling back job when job cannot running normally after scaling.
> h1. Examples (Which scenarios need rollback mechanism?)
> h2. Example1: The network memory is insufficient after scaling up.
> Flink task will request more network memories after scaling up. Flink job 
> cannot be started(failover infinitely) if network memory is insufficient.
> The job may have lag before scaling up, but it cannot run after scaling. We 
> have 2 solutions for this case:
>  * Autotuning is enabled : increasing the TM network memory and restart a 
> flink cluster
>  * Autotuning is disabled(In-place rescaling): Failover(retry) infinitely 
> will be useless, it's better to rollback job to the last parallelisms or the 
> first parallelisms.
> h2. Example2: GC-pressure or heap-usage is high
> Currently, Autoscaling will be paused if the GC pressure exceeds this limit 
> or the heap usage exceeds this threshold. (Checking 
> job.autoscaler.memory.gc-pressure.threshold and 
> job.autoscaler.memory.heap-usage.threshold options to get more details.)
> This case might happens after scaling down, there are 2 solutions as well:
>  * Autotuning is enabled : increasing the TM Heap memory (The TM total memory 
> may also need to be increased, currently Autotuning never increase the TM 
> total memory, only decrease it)
>  * Autotuning is disabled(In-place rescaling): Rollback job to the last 
> parallelisms or the first parallelisms.
> h1. Proposed change
> Note: the autotuning could be integrated with these examples in the future.
> This Jira introduces the JobUnrecoverableErrorChecker plugins(interfaces), 
> and we could defines 2 build-in customized checkers in the first 
> version(case1 and case2).
> {code:java}
> /**
>  * Check whether the job encountered an unrecoverable error.
>  *
>  * @param <KEY> The job key.
>  * @param <Context> Instance of JobAutoScalerContext.
>  */
> @Experimental
> public interface JobUnrecoverableErrorChecker<KEY, Context extends 
> JobAutoScalerContext<KEY>> {
>     /**
>      * @return True means job encountered an unrecoverable error, the scaling 
> will be rolled back.
>      *     Otherwise, the job ran normally or encountered a recoverable error.
>      */
>     boolean check(Context context, EvaluatedMetrics evaluatedMetrics);
> } {code}
> Rolling back job when any checker rule is true, and the scaling will be 
> paused until cluster is restarted.
> h2. What needs to be discussed is:
> should the job be rolled back to the parallelism initially set by the user, 
> or to the last parallelism before scaling?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37411) Introduce the rollback mechanism for Autoscaler

Reply via email to