ramkrishna.s.vasudevan created HBASE-6122: ---------------------------------------------
Summary: Backup master does not become Active master after ZK exception Key: HBASE-6122 URL: https://issues.apache.org/jira/browse/HBASE-6122 Project: HBase Issue Type: Bug Affects Versions: 0.94.0 Reporter: ramkrishna.s.vasudevan Fix For: 0.96.0, 0.94.1 -> Active master gets ZK expiry exception. -> Backup master becomes active. -> The previous active master retries and becomes the back up master. Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step. {code} if (abortNow(msg, t)) { if (t != null) LOG.fatal(msg, t); else LOG.fatal(msg); this.abort = true; stop("Aborting"); } {code} In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. {code} synchronized (this.clusterHasActiveMaster) { while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) { try { this.clusterHasActiveMaster.wait(); } catch (InterruptedException e) { // We expect to be interrupted when a master dies, will fall out if so LOG.debug("Interrupted waiting for master to die", e); } } if (!clusterStatusTracker.isClusterUp()) { this.master.stop("Cluster went down before this master became active"); } if (this.master.isStopped()) { return cleanSetOfActiveMaster; } // Try to become active master again now that there is no active master blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker); } return cleanSetOfActiveMaster; {code} When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from {code} // Try to become active master again now that there is no active master blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker); {code} We tend to return the 'cleanSetOfActiveMaster' which was previously false. Now because of this instead of again becoming active the back up master goes down in the abort() code. Thanks to Gopi,my colleague for reporting this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira