[ https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17812778#comment-17812778 ]
Maximilian Michels commented on FLINK-33092: -------------------------------------------- +1 waiting on resources in the Executing state. I think we need to just change the ScalingControler to delay triggering the actual rescale process: [https://github.com/apache/flink/blob/cb9e220c2291088459f0281aa8e8e8584436a9b2/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/scalingpolicy/RescalingController.java#L37] Right now, it triggers immediately on parallelism change. [~dmvk] can probably answer this. > Improve the resource-stabilization-timeout mechanism when rescale a job for > Adaptive Scheduler > ---------------------------------------------------------------------------------------------- > > Key: FLINK-33092 > URL: https://issues.apache.org/jira/browse/FLINK-33092 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Attachments: image-2023-09-15-14-43-35-104.png > > > !image-2023-09-15-14-43-35-104.png|width=916,height=647! > h1. 1. Propose > The above is the state transition graph when rescale a job in Adaptive > Scheduler. > In brief, when we trigger a rescale, the job will wait > _*resource-stabilization-timeout*_ in WaitingForResources State when it has > sufficient resources and it doesn't have the desired resource. > If the _*resource-stabilization-timeout mechanism*_ is moved into the > Executing State, the rescale downtime will be significantly reduced. > h1. 2. Why the downtime is long? > Currently, when rescale a job: > * The Executing will transition to Restarting > * The Restarting will cancel this job first. > * The Restarting will transition to WaitingForResources after the whole job > is terminal. > * When this job has sufficient resources and it doesn't have the desired > resource, the WaitingForResources needs to wait > _*resource-stabilization-timeout*_ . > * WaitingForResources will transition to CreatingExecutionGraph after > resource-stabilization-timeout. > The problem is the job isn't running during the > resource-stabilization-timeout phase. > h1. 3. How to reduce the downtime? > We can move the _*resource-stabilization-timeout mechanism*_ into the > Executing State when trigger a rescale. It means: > * When this job has desired resources, the Executing can rescale directly. > * When this job has sufficient resources and it doesn't have the desired > resource, we can rescale after _*resource-stabilization-timeout.*_ > * The WaitingForResources will ignore the resource-stabilization-timeout > after this improvement. > The resource-stabilization-timeout works before cancel job, so the rescale > downtime will be significantly reduced. > > Note: the resource-stabilization-timeout still works in WaitingForResources > when start a job. It's just changed when rescale a job. -- This message was sent by Atlassian Jira (v8.20.10#820010)