[ https://issues.apache.org/jira/browse/FLINK-32484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gyula Fora closed FLINK-32484. ------------------------------ Resolution: Duplicate > AdaptiveScheduler combined restart during scaling out > ----------------------------------------------------- > > Key: FLINK-32484 > URL: https://issues.apache.org/jira/browse/FLINK-32484 > Project: Flink > Issue Type: Improvement > Components: API / Core > Affects Versions: 1.17.0 > Reporter: Prabhu Joseph > Priority: Major > > On a scaling-out operation, when nodes are added at different times, > AdaptiveScheduler does multiple restarts within a short period of time. On > one of our Flink jobs, we have seen AdaptiveScheduler restart the > ExecutionGraph every time there is a notification of new resources to it. > There are five restarts within 3 minutes. > AdaptiveScheduler could provide a configurable restart window interval to the > user during which it combines the notified resources and restarts once when > the available resources are sufficient to fit the desired parallelism or when > the window times out. The window is created during the first notification of > resources received. This is applicable only when the execution graph is in > the executing state and not in the waiting for resources state. > > {code:java} > [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# grep -i scale * > jobmanager.log:2023-06-29 10:46:58,061 INFO > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New > resources are available. Restarting job to scale up. > jobmanager.log:2023-06-29 10:47:57,317 INFO > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New > resources are available. Restarting job to scale up. > jobmanager.log:2023-06-29 10:48:53,314 INFO > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New > resources are available. Restarting job to scale up. > jobmanager.log:2023-06-29 10:49:27,821 INFO > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New > resources are available. Restarting job to scale up. > jobmanager.log:2023-06-29 10:50:15,672 INFO > org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler [] - New > resources are available. Restarting job to scale up. > [root@ip-1-2-3-4 container_1688034805200_0002_01_000001]# {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)