[ https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611247#comment-15611247 ]
Zhijiang Wang commented on FLINK-4911: -------------------------------------- Yeah, it does "at-least once". BTW, I confirmed with Till that the heartbeat manager module has finished and the previous blocked jiras can be resumed. Some are assigned to me before related with heartbeat interaction between TM, JM and RM. I will work on them in the following days and then consider the JM failure issue, and I think the payload informations reported in the heartbeat messages maybe reused in JM failure scenario. > Non-disruptive JobManager Failures via Reconciliation > ------------------------------------------------------ > > Key: FLINK-4911 > URL: https://issues.apache.org/jira/browse/FLINK-4911 > Project: Flink > Issue Type: New Feature > Components: JobManager, TaskManager > Reporter: Stephan Ewen > Assignee: Zhijiang Wang > > JobManager failures can be handled in a non-disruptive way - by *reconciling* > the new JobManager leader and the TaskManagers. > I suggest to use this term (reconcile) - it has been uses also by other > frameworks (like Mesos) for non-disruptive handling of failures. > The basic approach is the following: > - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to > reconnect to the JobManager > - On connect, the TaskManager tells the JobManager about its currently > running tasks > - A new JobManager waits for TaskManagers to connect and report a task > status. It re-constructs the ExecutionGraph state from these reports > - Tasks whose status was not reconstructed in a certain time are assumed > failed and trigger regular task recovery. > To avoid having to re-implement this for the new JobManager / TaskManager > approach in *flip-6*, I suggest to directly implement this into the > {{flip-6}} feature branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)