[ https://issues.apache.org/jira/browse/HDFS-8161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521718#comment-14521718 ]
Chris Nauroth commented on HDFS-8161: ------------------------------------- [~brahmareddy], this is excellent debugging. Thank you for posting the information! I have commented on ZOOKEEPER-2175. They suspect their new wire encryption feature would catch a packet corruption issue like this. I've also proposed a ZooKeeper feature for checksum validation even when wire encryption is not used. (I think we'd have little motivation to use wire encryption for the HDFS HA use case, since the data we store in the znode isn't a secret.) Meanwhile, I'm wondering if there is something we can change in Hadoop code to make ourselves more resilient to this. The HDFS logic in this area is driven by ZooKeeper status code checks like the following in {{ActiveStandbyElector}}: {code} private static boolean isSuccess(Code code) { return (code == Code.OK); } {code} I'm wondering if we can check for a specific ZooKeeper client status code, and then reconnect our session and retry taking the lock instead of transitioning to standby. Do you know if there was a particular ZooKeeper status code that you saw when this happened? Do you have the capability to repro consistently? > Both Namenodes are in standby State > ----------------------------------- > > Key: HDFS-8161 > URL: https://issues.apache.org/jira/browse/HDFS-8161 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.6.0 > Reporter: Brahma Reddy Battula > Assignee: Brahma Reddy Battula > Attachments: ACTIVEBreadcumb and StandbyElector.txt > > > Suspected Scenario: > ================ > Start cluster with three Nodes. > Reboot Machine where ZKFC is not running..( Here Active Node ZKFC should open > session with this ZK ) > Now ZKFC ( Active NN's ) session expire and try re-establish connection with > another ZK...Bythe time ZKFC ( StndBy NN's ) will try to fence old active > and create the active Breadcrumb and Makes SNN to active state.. > But immediately it fence to standby state.. ( Here is the doubt) > Hence both will be in standby state.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)