[ 
https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521718#comment-14521718
 ] 

Chris Nauroth commented on HDFS-8161:
-------------------------------------

[~brahmareddy], this is excellent debugging.  Thank you for posting the 
information!

I have commented on ZOOKEEPER-2175.  They suspect their new wire encryption 
feature would catch a packet corruption issue like this.  I've also proposed a 
ZooKeeper feature for checksum validation even when wire encryption is not 
used.  (I think we'd have little motivation to use wire encryption for the HDFS 
HA use case, since the data we store in the znode isn't a secret.)

Meanwhile, I'm wondering if there is something we can change in Hadoop code to 
make ourselves more resilient to this.  The HDFS logic in this area is driven 
by ZooKeeper status code checks like the following in {{ActiveStandbyElector}}:

{code}
  private static boolean isSuccess(Code code) {
    return (code == Code.OK);
  }
{code}

I'm wondering if we can check for a specific ZooKeeper client status code, and 
then reconnect our session and retry taking the lock instead of transitioning 
to standby.  Do you know if there was a particular ZooKeeper status code that 
you saw when this happened?  Do you have the capability to repro consistently?


> Both Namenodes are in standby State
> -----------------------------------
>
>                 Key: HDFS-8161
>                 URL: https://issues.apache.org/jira/browse/HDFS-8161
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>    Affects Versions: 2.6.0
>            Reporter: Brahma Reddy Battula
>            Assignee: Brahma Reddy Battula
>         Attachments: ACTIVEBreadcumb and StandbyElector.txt
>
>
> Suspected Scenario:
> ================
> Start cluster with three Nodes.
> Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open 
> session with this ZK )
> Now  ZKFC ( Active NN's ) session expire and try re-establish connection with 
> another ZK...Bythe time  ZKFC ( StndBy NN's ) will try to fence old active 
> and create the active Breadcrumb and Makes SNN to active state..
> But immediately it fence to standby state.. ( Here is the doubt)
> Hence both will be in standby state..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to