[ https://issues.apache.org/jira/browse/FLINK-29308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611332#comment-17611332 ]
Zhu Zhu edited comment on FLINK-29308 at 9/30/22 3:34 AM: ---------------------------------------------------------- That may be the cause. If using fine-grained resources, NoResourceAvailableException could happen if Flink cannot find a {{matching}} slot for scheduled vertices (in coarse-grained resources case, a slot can always match any slot request). was (Author: zhuzh): That may be the cause. If using fine grained resource, NoResourceAvailableException could happen if Flink cannot find a {{matching}} slot for scheduled vertices (in coarse grained case, a slot can always match any slot request). > NoResourceAvailableException fails the batch job > ------------------------------------------------ > > Key: FLINK-29308 > URL: https://issues.apache.org/jira/browse/FLINK-29308 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination > Reporter: Aitozi > Priority: Major > > When running batch job configured with the following restart strategy > {code:java} > restart-strategy: fixed-delay > restart-strategy.fixed-delay.delay: 15 s > restart-strategy.fixed-delay.attempts: 10 {code} > If the cluster resource is not enough to run the single stage, it can run > partial of the stage, but it still will fail after the 10 times > {{{}NoResourceAvailableException{}}}. IMO, for batch job the > {{NoResourceAvailableException}} do not necessary to trigger the job to fail. > Or at least this failure reason is not suitable to share the same restart > strategy with other failure reasons -- This message was sent by Atlassian Jira (v8.20.10#820010)