Till Rohrmann created FLINK-10052:
-------------------------------------

             Summary: Tolerate temporarily suspended ZooKeeper connections
                 Key: FLINK-10052
                 URL: https://issues.apache.org/jira/browse/FLINK-10052
             Project: Flink
          Issue Type: Improvement
          Components: Distributed Coordination
    Affects Versions: 1.5.2, 1.4.2, 1.6.0
            Reporter: Till Rohrmann


This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
recovery and proposed the following solution to harden Flink:

The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator recipe 
for leader election. The leader latch revokes leadership in case of a suspended 
ZooKeeper connection. This can be premature in case that the system can 
reconnect to ZooKeeper before its session expires. The effect of the lost 
leadership is that all jobs will be canceled and directly restarted after 
regaining the leadership.

Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
connection, it would be better to wait until the ZooKeeper connection is LOST. 
That way we would allow the system to reconnect and not lose the leadership. 
This could be achievable by using Curator's {{LeaderSelector}} instead of the 
{{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to