[ 
https://issues.apache.org/jira/browse/FLINK-36531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899169#comment-17899169
 ] 

Maximilian Michels commented on FLINK-36531:
--------------------------------------------

Thanks for opening the issue. We currently rely on savepoint upgrade mode for 
autoscaling to work correctly. Other options to fix this issue would be to 
synchronize the rescaling time with the checkpoint completion time. Further, we 
could trigger a checkpoint externally to speed up the checkpoint interval.

 

> AutoScaler needs to consider the lag from last checkpoint
> ---------------------------------------------------------
>
>                 Key: FLINK-36531
>                 URL: https://issues.apache.org/jira/browse/FLINK-36531
>             Project: Flink
>          Issue Type: Improvement
>          Components: Autoscaler
>            Reporter: Sai Sharath Dandi
>            Priority: Major
>
> Autoscaler computes the target processing capacity as 
> [below|https://sg.uberinternal.com/code.uber.internal/uber-code/[email protected]/-/blob/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/utils/AutoScalerUtils.java?L47]
> // Target = LAG/CATCH_UP + INPUT_RATE*RESTART/CATCH_UP + 
> INPUT_RATE/TARGET_UTIL
>  
> During the scaling action, the autoscaler will restart the job from the last 
> successful checkpoint, we need to include the number of processed records 
> since last successful checkpoint as part of the lag as those records will be 
> replayed after scaling. This is particularly important for jobs with long 
> checkpoint intervals and large state as there could be a significant 
> difference between the realtime lag and the lag from the checkpoint



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to