Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2696
  
    That's a very good addition, we need something like that.
    
    After an offline discussion with @tillrohrmann we came to the following 
conclusion:
    
    There is a tricky problem with that pure appraoch: When the JobManager 
fails, all TaskManagers will "quarantine" that JobManager's actor system after 
they detected the failure. That means they exit and restart. Effectively, a 
JobManager failure results in all TaskManagers restarting.
    That is a bit heavy.
    
    Instead, we'll adjust this to do the following:
      - TaskManagers must not watch the JobManager via Akka. That way, 
JobManager failures do not cause any quarantining on the TaskManager side.
      - The JobManager keeps watching the TaskManagers via Akka, so TaskManager 
failures (false positives) still result in TaskManager quarantine, which means 
the TaskManager need to restart when a TM-JM link breaks
    
    How do TaskManagers then detect JobManager failure?
      - TaskManagers send heartbeats to the JobManager anyways (accumulators, 
in the future task status reconciliation). The TaskManagers use that to detect 
JobManager failures.
      - In high availability setups, TaskManagers notice JobManager failure 
also via ZooKeeper
      - In addition, in flip-6 the resource manager tells TaskManagers about 
JobManager container failures



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to