Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/2696
That's a very good addition, we need something like that.
After an offline discussion with @tillrohrmann we came to the following
conclusion:
There is a tricky problem with that pure appraoch: When the JobManager
fails, all TaskManagers will "quarantine" that JobManager's actor system after
they detected the failure. That means they exit and restart. Effectively, a
JobManager failure results in all TaskManagers restarting.
That is a bit heavy.
Instead, we'll adjust this to do the following:
- TaskManagers must not watch the JobManager via Akka. That way,
JobManager failures do not cause any quarantining on the TaskManager side.
- The JobManager keeps watching the TaskManagers via Akka, so TaskManager
failures (false positives) still result in TaskManager quarantine, which means
the TaskManager need to restart when a TM-JM link breaks
How do TaskManagers then detect JobManager failure?
- TaskManagers send heartbeats to the JobManager anyways (accumulators,
in the future task status reconciliation). The TaskManagers use that to detect
JobManager failures.
- In high availability setups, TaskManagers notice JobManager failure
also via ZooKeeper
- In addition, in flip-6 the resource manager tells TaskManagers about
JobManager container failures
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---