Till Rohrmann created FLINK-6625:
------------------------------------

             Summary: Flink removes HA job data when reaching JobStatus.FAILED
                 Key: FLINK-6625
                 URL: https://issues.apache.org/jira/browse/FLINK-6625
             Project: Flink
          Issue Type: Improvement
          Components: Distributed Coordination
    Affects Versions: 1.3.0, 1.4.0
            Reporter: Till Rohrmann


Currently, Flink removes all job related data (submitted {{JobGraph}} as well 
as checkpoints) when it reaches a globally terminal state (including 
{{JobStatus.FAILED}}). In high availability mode, this entails that all data is 
removed from ZooKeeper and there is no way to recover the job by restarting the 
cluster with the same cluster id.

I think this is problematic, since an application might just have failed 
because it has depleted its numbers of restart attempts. Also the last 
checkpoint information could be helpful when trying to find out why the job has 
actually failed. I propose that we only remove job data when reaching the state 
{{JobStatus.SUCCESS}} or {{JobStatus.CANCELED}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to