Till Rohrmann created FLINK-11813:
-------------------------------------

             Summary: Standby per job mode Dispatchers don't know job's 
JobSchedulingStatus
                 Key: FLINK-11813
                 URL: https://issues.apache.org/jira/browse/FLINK-11813
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.7.2, 1.6.4, 1.8.0
            Reporter: Till Rohrmann


At the moment, it can happen that standby {{Dispatchers}} in per job mode will 
restart a terminated job after they gained leadership. The problem is that we 
currently clear the {{RunningJobsRegistry}} once a job has reached a globally 
terminal state. After the leading {{Dispatcher}} terminates, a standby 
{{Dispatcher}} will gain leadership. Without having the information from the 
{{RunningJobsRegistry}} it cannot tell whether the job has been executed or 
whether the {{Dispatcher}} needs to re-execute the job. At the moment, the 
{{Dispatcher}} will assume that there was a fault and hence re-execute the job. 
This can lead to duplicate results.

I think we need some way to tell standby {{Dispatchers}} that a certain job has 
been successfully executed. One trivial solution could be to not clean up the 
{{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to