[ 
https://issues.apache.org/jira/browse/FLINK-31457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17700680#comment-17700680
 ] 

Aleksandr Pilipenko commented on FLINK-31457:
---------------------------------------------

[~JunRuiLi], thank you for your response.

Issue is triggered by loss of TaskManager during job execution. Before new 
instance will become available - every attempt to restart the job will result 
in NoResourceAvailableException

Setup is: standalone cluster running single job with number of slots matching 
job requirements.

Example scenario, leading to this issue:

When job restarts there is a chance that subtask won't be able to cancel 
gracefully within task.cancellation.timeout (e.g. due to issues like 
FLINK-30304). This results in TaskManager being shutdown.

Before new instance of TaskManager will become available, every attempt to 
schedule job will immediately fail with NoResourceAvailableException. If 
configured restart delay is less than task cancellation timeout - first restart 
attempt will be performed immediately after cancellation is finished, i.e. 
right after TaskManager has been stopped.

 

Standalone resource manager support waiting for required resources, but only 
during startup. 
[[1]|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#resourcemanager-standalone-start-up-time]
 
[[2]|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/StandaloneResourceManager.java#L110-L121]

> Support waiting for required resources in DefaultScheduler during job restart
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-31457
>                 URL: https://issues.apache.org/jira/browse/FLINK-31457
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.3
>            Reporter: Aleksandr Pilipenko
>            Priority: Major
>
> Currently Flink support [waiting for required resources to become 
> available|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout]
>  during job restart only while using adaptive scheduler.
> On the other hand, if cluster is using default scheduler and there is not 
> enough slots available - restart attempts will fail with 
> `NoResourceAvailableException` until resource requirements are satisfied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to