Aitozi created FLINK-24063:
------------------------------
Summary: Reconsider the behavior of ClusterEntrypoint#startCluster
failure handler
Key: FLINK-24063
URL: https://issues.apache.org/jira/browse/FLINK-24063
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Reporter: Aitozi
If the job runCluster failed, it will trigger the STOP_APPLICATION behavior.
But if we consider a case like that:
# The JobManager encounter a fatal error like the network problem, which may
let the jobManager process down
# Then a new process will be started by the resource framework like yarn or
kubernetes. But it will failed at the ClusterEntrypoint#startCluster due to the
same network problem.
# Then the job turn into the FAILED status.
This means a streaming job will no longer run due to some fatal error, this is
somehow fragile. I think we should give some retry mechanism to prevent the job
fast fail twice ,so that deal with some external error which may keep for a
period of time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)