[ 
https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783977#comment-16783977
 ] 

TisonKun commented on FLINK-11813:
----------------------------------

We would never keep forever the {{RunningJobsRegistry}} since it obviously 
causes resource leak.

Here, we meet the problem that if DispatcherA and DispatcherB takes the same 
job, after DispatcherA finished it, because the execution order, it is possible 
that DispatcherB re-execute the job. However, if one dispatcher finished a job, 
and later a new per job cluster was launched with the same job, we could regard 
them as two different jobs.

Thus we convert the problem to that if a job finished, the current running but 
not granted leadership(i.e., standby) dispatchers noticed it. Although a 
dispatcher not the leader should never write ZooKeeper, it was allowed to 
register a watcher. Then during the standby dispatcher running, if it was 
notified that children under {{RunningJobsRegistry}} path changes, it can react 
to check whether its corresponding job could be cancelled to be executed.

Follow this way, as side-effect, we don't need {{DONE}} status since it implied 
by a transition from {{RUNNING}} to {{PENDING(or say, not running)}}. This 
would simplify the logic, and currently, it is ambiguous(as this issue figured 
out) between {{PENDING}} and {{DONE}} (since we clean up on job finished).

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode 
> will restart a terminated job after they gained leadership. The problem is 
> that we currently clear the {{RunningJobsRegistry}} once a job has reached a 
> globally terminal state. After the leading {{Dispatcher}} terminates, a 
> standby {{Dispatcher}} will gain leadership. Without having the information 
> from the {{RunningJobsRegistry}} it cannot tell whether the job has been 
> executed or whether the {{Dispatcher}} needs to re-execute the job. At the 
> moment, the {{Dispatcher}} will assume that there was a fault and hence 
> re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job 
> has been successfully executed. One trivial solution could be to not clean up 
> the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to