[ https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rui Fan updated FLINK-33092: ---------------------------- Description: !image-2023-09-15-14-43-35-104.png|width=916,height=647! h1. 1. Propose The above is the state transition graph when rescale a job in Adaptive Scheduler. In brief, when we trigger a rescale, the job will wait _*resource-stabilization-timeout*_ in WaitingForResources State when it has sufficient resources and it doesn't have the desired resource. If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing State, the rescale downtime will be significantly reduced. h1. 2. Why the downtime is long? Currently, when rescale a job: * The Executing will transition to Restarting * The Restarting will cancel this job first. * The Restarting will transition to WaitingForResources after the whole job is terminal. * When this job has sufficient resources and it doesn't have the desired resource, the WaitingForResources needs to wait _*resource-stabilization-timeout*_ . * WaitingForResources will transition to CreatingExecutionGraph after resource-stabilization-timeout. The problem is the job isn't running during the resource-stabilization-timeout phase. h1. 3. How to reduce the downtime? We can move the _*resource-stabilization-timeout mechanism*_ into the Executing State when trigger a rescale. It means: * When this job has desired resources, the Executing can rescale directly. * When this job has sufficient resources and it doesn't have the desired resource, we can rescale after _*resource-stabilization-timeout.*_ * The WaitingForResources will ignore the resource-stabilization-timeout after this improvement. The resource-stabilization-timeout works before cancel job, so the rescale downtime will be significantly reduced. Note: the resource-stabilization-timeout still works in WaitingForResources when start a job. It's just changed when rescale a job. was: !image-2023-09-15-14-43-35-104.png|width=1103,height=779! h1. 1. Propose The above is the state transition graph when rescale a job in Adaptive Scheduler. In brief, when we trigger a rescale, the job will wait _*resource-stabilization-timeout*_ in WaitingForResources State when it has sufficient resources and it doesn't have the desired resource. If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing State, the rescale downtime will be significantly reduced. h1. 2. Why the downtime is long? Currently, when rescale a job: * The Executing will transition to Restarting * The Restarting will cancel this job first. * The Restarting will transition to WaitingForResources after the whole job is terminal. * When this job has sufficient resources and it doesn't have the desired resource, the WaitingForResources needs to wait _*resource-stabilization-timeout*_ . * WaitingForResources will transition to CreatingExecutionGraph after resource-stabilization-timeout. The problem is the job isn't running during the resource-stabilization-timeout phase. h1. 3. How to reduce the downtime? We can move the _*resource-stabilization-timeout mechanism*_ into the Executing State when trigger a rescale. It means: * When this job has desired resources, the Executing can rescale directly. * When this job has sufficient resources and it doesn't have the desired resource, we can rescale after _*resource-stabilization-timeout.*_ * The WaitingForResources will ignore the resource-stabilization-timeout after this improvement. The resource-stabilization-timeout works before cancel job, so the rescale downtime will be significantly reduced. Note: the resource-stabilization-timeout still works in WaitingForResources when start a job. It's just changed when rescale a job. > Improve the resource-stabilization-timeout mechanism when rescale a job for > Adaptive Scheduler > ---------------------------------------------------------------------------------------------- > > Key: FLINK-33092 > URL: https://issues.apache.org/jira/browse/FLINK-33092 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Rui Fan > Assignee: Rui Fan > Priority: Major > Attachments: image-2023-09-15-14-43-35-104.png > > > !image-2023-09-15-14-43-35-104.png|width=916,height=647! > h1. 1. Propose > The above is the state transition graph when rescale a job in Adaptive > Scheduler. > In brief, when we trigger a rescale, the job will wait > _*resource-stabilization-timeout*_ in WaitingForResources State when it has > sufficient resources and it doesn't have the desired resource. > If the _*resource-stabilization-timeout mechanism*_ is moved into the > Executing State, the rescale downtime will be significantly reduced. > h1. 2. Why the downtime is long? > Currently, when rescale a job: > * The Executing will transition to Restarting > * The Restarting will cancel this job first. > * The Restarting will transition to WaitingForResources after the whole job > is terminal. > * When this job has sufficient resources and it doesn't have the desired > resource, the WaitingForResources needs to wait > _*resource-stabilization-timeout*_ . > * WaitingForResources will transition to CreatingExecutionGraph after > resource-stabilization-timeout. > The problem is the job isn't running during the > resource-stabilization-timeout phase. > h1. 3. How to reduce the downtime? > We can move the _*resource-stabilization-timeout mechanism*_ into the > Executing State when trigger a rescale. It means: > * When this job has desired resources, the Executing can rescale directly. > * When this job has sufficient resources and it doesn't have the desired > resource, we can rescale after _*resource-stabilization-timeout.*_ > * The WaitingForResources will ignore the resource-stabilization-timeout > after this improvement. > The resource-stabilization-timeout works before cancel job, so the rescale > downtime will be significantly reduced. > > Note: the resource-stabilization-timeout still works in WaitingForResources > when start a job. It's just changed when rescale a job. -- This message was sent by Atlassian Jira (v8.20.10#820010)