[ https://issues.apache.org/jira/browse/HADOOP-9459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated HADOOP-9459: -------------------------------- Resolution: Fixed Fix Version/s: 2.0.5-beta 3.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed to trunk and branch-2. Thanks for tracking down this tricky bug, Vinay > ActiveStandbyElector can join election even before Service HEALTHY, and > results in null data at ActiveBreadCrumb > ---------------------------------------------------------------------------------------------------------------- > > Key: HADOOP-9459 > URL: https://issues.apache.org/jira/browse/HADOOP-9459 > Project: Hadoop Common > Issue Type: Bug > Components: ha > Affects Versions: 2.0.2-alpha > Reporter: Vinay > Assignee: Vinay > Priority: Critical > Fix For: 3.0.0, 2.0.5-beta > > Attachments: HDFS-4463.patch, hdfs-4463.txt > > > ActiveStandbyElector can store null at ActiveBreadCrumb in the below race > condition. At further all failovers will fail resulting NPE. > 1. ZKFC restarted. > 2. due to machine busy, first zk connection is expired even before the health > monitoring returned the status. > 3. On re-establishment transitionToActive will be called, at this time > appData will be null, > 4. So now ActiveBreadCrumb will have null. > 5. After this any failovers will fail throwing > {noformat}java.lang.NullPointerException > at > org.apache.hadoop.util.StringUtils.byteToHexString(StringUtils.java:171) > at > org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:892) > at > org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:797) > at > org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:475) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:545) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497){noformat} > Should not join the election before service is HEALTHY -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira