Gyula Fora created FLINK-34318:
----------------------------------
Summary: AdaptiveScheduler resource stabilisation should happen
before the job is cancelled
Key: FLINK-34318
URL: https://issues.apache.org/jira/browse/FLINK-34318
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Reporter: Gyula Fora
When a new resource requirement is submitted to the AdaptiveScheduler which
increases the resource upper bound (max taskmanagers), when the first
TaskManager comes up the job is immediately cancelled.
Once the job is cancelled the scheduler waits for the entire stabilisation
period to pass if it cannot acquire all resources before starting with the
lower-than-requested parallelism.
The problem here is that waiting for resource stabilisation happens after the
job is cancelled, introducing unnecessary downtime for the job if the
stabilisation period is large.
We should change logic here to wait for the stabilisation period first to
acquire all possible resources before cancelling the job.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)