[ https://issues.apache.org/jira/browse/FLINK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721001#comment-16721001 ]
TisonKun edited comment on FLINK-6625 at 12/14/18 7:04 AM: ----------------------------------------------------------- Does it mean when job finished with FAILED, the clean-up job would be done by the user? Recently I participate in the rethink of zookeeper based HaService, and find that it would be helpful if we write down how we want flink to deal with termination in HA Mode. - FINISHED: Globally termination, clean up HA data - FAILED: Globally termination, retain HA data(checkpoint and job graph). - CANCELED: Globally termination, clean up HA data - SUSPENDED: Locally termination, if the dispatcher failed and recovered, it would be restarted, retain HA data And since we retain HA data even the dispatcher exist, now it is users' responsibility to do the clean-up job. was (Author: tison): Does it mean when job finished with FAILED, the clean-up job would be done by the user? Recently I participate in the rethink of zookeeper based HaService, and find that it would be good if we write down how we want flink to deal with non-success termination in HA Mode. - FAILED: Globally termination, retain HA data(checkpoint and job graph). - CANCELED: Globally termination, clean up HA data - SUSPENDED: Locally termination, if the dispatcher failed and recovered, it would be restarted, retain HA data And since we retain HA data even the dispatcher exist, now it is users' responsibility to do the clean-up job. > Flink removes HA job data when reaching JobStatus.FAILED > -------------------------------------------------------- > > Key: FLINK-6625 > URL: https://issues.apache.org/jira/browse/FLINK-6625 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination > Affects Versions: 1.3.0, 1.4.0 > Reporter: Till Rohrmann > Priority: Major > > Currently, Flink removes all job related data (submitted {{JobGraph}} as well > as checkpoints) when it reaches a globally terminal state (including > {{JobStatus.FAILED}}). In high availability mode, this entails that all data > is removed from ZooKeeper and there is no way to recover the job by > restarting the cluster with the same cluster id. > I think this is problematic, since an application might just have failed > because it has depleted its numbers of restart attempts. Also the last > checkpoint information could be helpful when trying to find out why the job > has actually failed. I propose that we only remove job data when reaching the > state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)