Eron Wright created FLINK-8541: ----------------------------------- Summary: Mesos RM should recover from failover timeout Key: FLINK-8541 URL: https://issues.apache.org/jira/browse/FLINK-8541 Project: Flink Issue Type: Bug Components: Cluster Management, Mesos Affects Versions: 1.3.0 Reporter: Eron Wright Assignee: Eron Wright
When a framework disconnects unexpectedly from Mesos, the framework's Mesos tasks continue to run for a configurable period of time known as the failover timeout. If the framework reconnects to Mesos after the timeout has expired, Mesos rejects the connection attempt. It is expected that the framework discard the previous framework ID and then connect as a new framework. When Flink is in this situation, the only recourse is to manually delete the ZK state where the framework ID kept. Let's improve the logic of the Mesos RM to automate that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)