Till Rohrmann created FLINK-6625: ------------------------------------ Summary: Flink removes HA job data when reaching JobStatus.FAILED Key: FLINK-6625 URL: https://issues.apache.org/jira/browse/FLINK-6625 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.3.0, 1.4.0 Reporter: Till Rohrmann
Currently, Flink removes all job related data (submitted {{JobGraph}} as well as checkpoints) when it reaches a globally terminal state (including {{JobStatus.FAILED}}). In high availability mode, this entails that all data is removed from ZooKeeper and there is no way to recover the job by restarting the cluster with the same cluster id. I think this is problematic, since an application might just have failed because it has depleted its numbers of restart attempts. Also the last checkpoint information could be helpful when trying to find out why the job has actually failed. I propose that we only remove job data when reaching the state {{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)