Thanks for informing us. As far as I know, we were not aware of any deadlock in the JobManager election. Let's hope that the updated Curator version fixed the problem. We will defenitely keep an eye on this. Feel free to contact the dev@ mailing list, if the problem still exists in 1.3.2.

Regards,
Timo

Am 06.09.17 um 22:58 schrieb James Bucher:
It seems I forgot to attach the full logs. They should be attached now.

From: James Bucher <jbuc...@expedia.com <mailto:jbuc...@expedia.com>>
Date: Wednesday, September 6, 2017 at 10:43 AM
To: "user@flink.apache.org <mailto:user@flink.apache.org>" <user@flink.apache.org <mailto:user@flink.apache.org>>
Subject: Flink 1.2.1 JobManager Election Deadlock

    Hey all,

    Just wanted to report this for posterity in case someone else sees
    something similar. We were running Flink 1.2.1 in Kubernetes. We
    use an HA setup with an external Zookeeper and S3 for
    checkpointing. We recently noticed a job that appears to have
    deadlocked on JobManager Leader election. We think the issue
    happened something like the following:

     1. Job was up and running normally
     2. Some cluster event caused the JobManager Pod (process) to get
        restarted.
     3. JobManager came up again but got stuck on LeaderElection. At
        this time the JobManager UI sent back a response with "Service
        temporarily unavailable due to an ongoing leader election.
        Please refresh."
     4. JobManager never exited this state.
     5. This state persisted across Pod restarts/deletes.

    In order to try to pin down this problem we brought up the Job in
    another Flink Cluster and started debugging this issue. As a first
    step I upped the logging level on the JobManager and applied the
    change. This resulted in the following log (Full logs attached to
    this email):

    2017-09-05 18:50:55,072 TRACE
    org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
    - Trace: GetDataBuilderImpl-Background - 4 ms

    2017-09-05 18:50:55,075 DEBUG
    org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
    - Received CHILD_ADDED event (path: /4e4cdc8fb8c1437c620cd4063bd265e1)

    2017-09-05 18:50:55,088 DEBUG
    org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
    - Received CHILD_ADDED event notification for job
    4e4cdc8fb8c1437c620cd4063bd265e1

    2017-09-05 18:50:56,072 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 1,3 
    replyHeader:: 1,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com,F response::
    s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

    2017-09-05 18:50:56,078 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 2,3 
    replyHeader:: 2,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
    response::
    s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

    2017-09-05 18:50:56,079 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 3,3 
    replyHeader:: 3,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
    response::
    s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

    2017-09-05 18:50:56,080 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 4,3 
    replyHeader:: 4,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
    response::
    s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

    2017-09-05 18:50:56,081 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 5,3 
    replyHeader:: 5,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
    response::
    
s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

    2017-09-05 18:50:56,085 INFO
    org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
    - Starting ZooKeeperLeaderRetrievalService.

    2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 6,3 
    replyHeader:: 6,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com,F response::
    s{4305881524,4305881524,1497456679255,1497456679255,0,3,0,0,0,3,21475572278}

    2017-09-05 18:50:56,087 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 7,3 
    replyHeader:: 7,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue,F
    response::
    s{4305881525,4305881525,1497456679260,1497456679260,0,1,0,0,0,1,4305881526}

    2017-09-05 18:50:56,088 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 8,3 
    replyHeader:: 8,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip,F
    response::
    s{4305881526,4305881526,1497456679263,1497456679263,0,1,0,0,0,1,4305881527}

    2017-09-05 18:50:56,090 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 9,3 
    replyHeader:: 9,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink,F
    response::
    s{4305881527,4305881527,1497456679267,1497456679267,0,1,0,0,0,1,4305881528}

    2017-09-05 18:50:56,091 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    clientPath:null serverPath:null finished:false header:: 10,3 
    replyHeader:: 10,21476396435,0 request::
    '/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default,F
    response::
    
s{4305881528,4305881528,1497456679270,1497456679270,0,27,0,0,0,5,21475449005}

    2017-09-05 18:50:56,092 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    
clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
    
serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
    finished:false header:: 11,3  replyHeader:: 11,21476396435,0 
    request::
    
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
    response::
    
s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

    2017-09-05 18:50:56,092 TRACE
    org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
    - Trace: ExistsBuilderImpl-Background - 1 ms

    2017-09-05 18:50:56,093 DEBUG org.apache.zookeeper.ClientCnxn    
          - Reading reply sessionid:0x15e3df8b5a57b4b, packet::
    
clientPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
    
serverPath:/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader
    finished:false header:: 12,4  replyHeader:: 12,21476396435,0 
    request::
    
'/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/default/leader,T
    response::
    
,s{21475449005,21475449005,1503519415753,1503519415753,0,1,0,0,0,1,21475449006}

    2017-09-05 18:50:56,093 TRACE
    org.apache.flink.shaded.org.apache.curator.utils.DefaultTracerDriver
    - Trace: GetDataBuilderImpl-Background - 1 ms

    2017-09-05 18:50:56,093 DEBUG
    org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
    - Leader node has changed.


    At this point I dug into the logs to try and see what was going
    on. It quickly became apparent that this probably had something to
    do with Apache Curator which Flink relies upon to do the leader
    election. At this point I grabbed a dump of zookeeper using:
    https://github.com/ctapmex/zkTreeUtil:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <root
    path="/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink">
      <zknode name="default">
        <zknode name="leaderlatch">
          <zknode name="00000000000000000000000000000000">
            <zknode name="job_manager_lock"/>
          </zknode>
          <zknode ephemeral="true"
    name="_c_9cd3f122-65de-40e7-8e2b-aa34e6a85b83-latch-0000000020"/>
        </zknode>
        <zknode name="checkpoint-counter">
          <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
    value="&#0;&#0;&#0;&#1;"/>
        </zknode>
        <zknode name="checkpoints">
          <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"/>
        </zknode>
        <zknode name="jobgraphs">
          <zknode name="4e4cdc8fb8c1437c620cd4063bd265e1"
    
value="��&#0;&#5;sr&#0;;org.apache.flink.runtime.state.RetrievableStreamStateHandle&#0;&#1;&#30;&#24;�U�+&#2;&#0;&#1;L&#0;&#24;wrappedStreamStateHandlet&#0;2Lorg/apache/flink/runtime/state/StreamStateHandle;xpsr&#0;9org.apache.flink.runtime.state.filesystem.FileStateHandle&#4;�u�b�&#27;�&#2;&#0;&#2;J&#0;&#9;stateSizeL&#0;&#8;filePatht&#0;&#31;Lorg/apache/flink/core/fs/Path;xp&#0;&#0;&#0;&#0;&#0;&#15;h�sr&#0;&#29;org.apache.flink.core.fs.Path&#0;&#0;&#0;&#0;&#0;&#0;&#0;&#1;&#2;&#0;&#1;L&#0;&#3;urit&#0;&#14;Ljava/net/URI;xpsr&#0;&#12;java.net.URI�&#1;x.C�I�&#3;&#0;&#1;L&#0;&#6;stringt&#0;&#18;Ljava/lang/String;xpt&#0;�s3a://ewetest-traveler-profile/client-state/krazyglue/traveler-profile.us-west-2c.test.expedia.com/krazyglue/mip/flink/zookeeperCheckpoints/submittedJobGraph6507f982fe76x"/>
        </zknode>
        <zknode name="leader">
          <zknode name="00000000000000000000000000000000"/>
        </zknode>
      </zknode>
    </root>

    In a healthy cluster there is generally a node under
    “job_manager_lock” which represents the elected leader. This
    pointed to a problem with the Apache Curator framework. I opened
    up both Flink 1.2. And 1.3.2 to see what version of Curator Flink
    1.2.1 and 1.3.2 used. I noticed that Flink 1.3.2 uses Curator
    2.12.0 whereas Flink 1.2.1 uses 2.8.0. Looking at the release
    notes for curator there were quite a few bugs might be the cause
    of this (https://issues.apache.org/jira/browse/CURATOR-264 for
    example). I was curious to see if leader election would become
    unstuck if the cluster was upgraded to Flink 1.3.2. I upgraded the
    JobManager to 1.3.2 and sure enough leader election proceeded
    correctly. Upon Upgrading the TaskManagers the job restored from
    its last checkpoint correctly and continued.

    Other interesting things of note:

     1. Restarting/Killing the JobManager didn’t resolve the issue. I
        suspect this is due to bad state in zookeeper
     2. Zookeeper was running throughout this time. We didn’t see
        issues in our kafka cluster and Zookeeper has a leader elected
        correctly and the Zookeeper command line utility worked
        without issue.
     3. Restarts of the JobManager appeared to correctly create
        ephemeral nodes in zookeeper to represent possible leaders
        (which also appear to get cleaned up correctly).

    As far as I can tell this issue is resolved with the newer Apache
    Curator version in 1.3.2. It may be useful to keep an eye out for
    this issue however.

    Thanks,
    James Bucher


Reply via email to