[ 
https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308824#comment-17308824
 ] 

Till Rohrmann commented on FLINK-11813:
---------------------------------------

The more I think about this problem the more I am convinced that the 
{{RunningJobsRegistry}} must be able to outlive a concrete {{Dispatcher}} in 
order to solve the standby JobManager problem. Only then it is possible to rely 
on the registry for filtering out job submissions/restarts for jobs which have 
actually be completed. This would then imply that the deployer of the Flink 
cluster is responsible for cleaning this registry once it has received the job 
result and shut down the cluster.

If the deployer/user/owner of the cluster does not do the clean up, then this 
would lead to orphaned entries in ZooKeeper or K8s, for example. In order to 
avoid this, one could make the usage of {{RunningJobsRegistry}} optional which 
means that one needs to activate it explicitly.

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode 
> will restart a terminated job after they gained leadership. The problem is 
> that we currently clear the {{RunningJobsRegistry}} once a job has reached a 
> globally terminal state. After the leading {{Dispatcher}} terminates, a 
> standby {{Dispatcher}} will gain leadership. Without having the information 
> from the {{RunningJobsRegistry}} it cannot tell whether the job has been 
> executed or whether the {{Dispatcher}} needs to re-execute the job. At the 
> moment, the {{Dispatcher}} will assume that there was a fault and hence 
> re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job 
> has been successfully executed. One trivial solution could be to not clean up 
> the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to