Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/2696 That's a very good addition, we need something like that. After an offline discussion with @tillrohrmann we came to the following conclusion: There is a tricky problem with that pure appraoch: When the JobManager fails, all TaskManagers will "quarantine" that JobManager's actor system after they detected the failure. That means they exit and restart. Effectively, a JobManager failure results in all TaskManagers restarting. That is a bit heavy. Instead, we'll adjust this to do the following: - TaskManagers must not watch the JobManager via Akka. That way, JobManager failures do not cause any quarantining on the TaskManager side. - The JobManager keeps watching the TaskManagers via Akka, so TaskManager failures (false positives) still result in TaskManager quarantine, which means the TaskManager need to restart when a TM-JM link breaks How do TaskManagers then detect JobManager failure? - TaskManagers send heartbeats to the JobManager anyways (accumulators, in the future task status reconciliation). The TaskManagers use that to detect JobManager failures. - In high availability setups, TaskManagers notice JobManager failure also via ZooKeeper - In addition, in flip-6 the resource manager tells TaskManagers about JobManager container failures
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---