[jira] [Updated] (FLINK-33092) Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler

Rui Fan (Jira) Fri, 15 Sep 2023 00:21:34 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-33092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rui Fan updated FLINK-33092:
----------------------------
    Description: 
!image-2023-09-15-14-43-35-104.png|width=1103,height=779!
h1. 1. Propose

The above is the state transition graph when rescale a job in Adaptive 
Scheduler.

In brief, when we trigger a rescale, the job will wait 
_*resource-stabilization-timeout*_ in WaitingForResources State when it has 
sufficient resources and it doesn't have the desired resource.

If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing 
State, the rescale downtime will be significantly reduced.
h1. 2. Why the downtime is long?

Currently, when rescale a job:
 * The Executing will transition to Restarting
 * The Restarting will cancel this job first.
 * The Restarting will transition to WaitingForResources after the whole job is 
terminal.
 * When this job has sufficient resources and it doesn't have the desired 
resource, the WaitingForResources needs to wait  
_*resource-stabilization-timeout*_ .
 * WaitingForResources will transition to CreatingExecutionGraph after  
resource-stabilization-timeout.

The problem is the job isn't running during the resource-stabilization-timeout 
phase.
h1. 3. How to reduce the downtime?

We can move the _*resource-stabilization-timeout mechanism*_ into the Executing 
State when trigger a rescale. It means:
 * When this job has desired resources, the Executing can rescale directly.
 * When this job has sufficient resources and it doesn't have the desired 
resource, we can rescale after _*resource-stabilization-timeout.*_
 * The WaitingForResources will ignore the resource-stabilization-timeout after 
this improvement.

The resource-stabilization-timeout works before cancel job, so the rescale 
downtime will be significantly reduced.

 

Note: the resource-stabilization-timeout still works in WaitingForResources 
when start a job. It's just changed when rescale a job.

  was:
!image-2023-09-15-14-43-35-104.png|width=776,height=548!
h1. 1. Propose

The above is the state transition graph when rescale a job in Adaptive 
Scheduler.

In brief, when we trigger a rescale, the job will wait 
_*resource-stabilization-timeout*_ in WaitingForResources State when it has 
sufficient resources and it doesn't have the desired resource.

If the _*resource-stabilization-timeout mechanism*_ is moved into the Executing 
State, the rescale downtime will be significantly reduced.
h1. 2. Why the downtime is long?can be significantly reduced

Currently, when rescale a job:
 * The Executing will transition to Restarting
 * The Restarting will cancel this job first.
 * The Restarting will transition to WaitingForResources after the whole job is 
terminal.
 * When this job has sufficient resources and it doesn't have the desired 
resource, the WaitingForResources needs to wait  
_*resource-stabilization-timeout*_ .
 * WaitingForResources will transition to CreatingExecutionGraph after  
resource-stabilization-timeout.

The problem is the job isn't running during the resource-stabilization-timeout 
phase.
h1. 3. How to reduce the downtime?

We can move the _*resource-stabilization-timeout mechanism*_ into the Executing 
State when trigger a rescale. It means:
 * When this job has desired resources, the Executing can rescale directly.
 * When this job has sufficient resources and it doesn't have the desired 
resource, we can rescale after _*resource-stabilization-timeout.*_
 * The WaitingForResources will ignore the resource-stabilization-timeout after 
this improvement.

The resource-stabilization-timeout works before cancel job, so the rescale 
downtime will be significantly reduced.

 

Note: the resource-stabilization-timeout still works in WaitingForResources 
when start a job. It's just changed when rescale a job.


> Improve the resource-stabilization-timeout mechanism when rescale a job for 
> Adaptive Scheduler
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33092
>                 URL: https://issues.apache.org/jira/browse/FLINK-33092
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>         Attachments: image-2023-09-15-14-43-35-104.png
>
>
> !image-2023-09-15-14-43-35-104.png|width=1103,height=779!
> h1. 1. Propose
> The above is the state transition graph when rescale a job in Adaptive 
> Scheduler.
> In brief, when we trigger a rescale, the job will wait 
> _*resource-stabilization-timeout*_ in WaitingForResources State when it has 
> sufficient resources and it doesn't have the desired resource.
> If the _*resource-stabilization-timeout mechanism*_ is moved into the 
> Executing State, the rescale downtime will be significantly reduced.
> h1. 2. Why the downtime is long?
> Currently, when rescale a job:
>  * The Executing will transition to Restarting
>  * The Restarting will cancel this job first.
>  * The Restarting will transition to WaitingForResources after the whole job 
> is terminal.
>  * When this job has sufficient resources and it doesn't have the desired 
> resource, the WaitingForResources needs to wait  
> _*resource-stabilization-timeout*_ .
>  * WaitingForResources will transition to CreatingExecutionGraph after  
> resource-stabilization-timeout.
> The problem is the job isn't running during the 
> resource-stabilization-timeout phase.
> h1. 3. How to reduce the downtime?
> We can move the _*resource-stabilization-timeout mechanism*_ into the 
> Executing State when trigger a rescale. It means:
>  * When this job has desired resources, the Executing can rescale directly.
>  * When this job has sufficient resources and it doesn't have the desired 
> resource, we can rescale after _*resource-stabilization-timeout.*_
>  * The WaitingForResources will ignore the resource-stabilization-timeout 
> after this improvement.
> The resource-stabilization-timeout works before cancel job, so the rescale 
> downtime will be significantly reduced.
>  
> Note: the resource-stabilization-timeout still works in WaitingForResources 
> when start a job. It's just changed when rescale a job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-33092) Improve the resource-stabilization-timeout mechanism when rescale a job for Adaptive Scheduler

Reply via email to