Eron Wright  created FLINK-8541:
-----------------------------------

             Summary: Mesos RM should recover from failover timeout
                 Key: FLINK-8541
                 URL: https://issues.apache.org/jira/browse/FLINK-8541
             Project: Flink
          Issue Type: Bug
          Components: Cluster Management, Mesos
    Affects Versions: 1.3.0
            Reporter: Eron Wright 
            Assignee: Eron Wright 


When a framework disconnects unexpectedly from Mesos, the framework's Mesos 
tasks continue to run for a configurable period of time known as the failover 
timeout.   If the framework reconnects to Mesos after the timeout has expired, 
Mesos rejects the connection attempt.   It is expected that the framework 
discard the previous framework ID and then connect as a new framework.

When Flink is in this situation, the only recourse is to manually delete the ZK 
state where the framework ID kept.   Let's improve the logic of the Mesos RM to 
automate that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to