lamber-ken created FLINK-13189:
----------------------------------

             Summary: Fix the impact of zookeeper network disconnect 
temporarily on flink long running jobs
                 Key: FLINK-13189
                 URL: https://issues.apache.org/jira/browse/FLINK-13189
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.8.1
            Reporter: lamber-ken
            Assignee: lamber-ken
             Fix For: 1.9.0


*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use 
zookeeper as HighAvailabilityService, but we found that flink job will restart 
because of the network was disconnected temporarily between jobmanager and 
zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's 
`+LeaderLatch+` to maintain the leadership. When network disconncet, the 
`+LeaderLatch+` will change leadership to false directly. We think it's too 
brutally that many flink longrunning jobs will restart because of the network 
shake.

 

*Fix this issue*

>From curator official website, we found that this issuse was fixed at 
>curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
>to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
>and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based 
>on the above considerations, we update `LeaderLatch` at flink-shaded-curator 
>module.

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
[https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
[http://curator.apache.org/curator-recipes/leader-latch.html]

  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to