Author: liyin Date: Sat Dec 28 19:18:17 2013 New Revision: 1553891 URL: http://svn.apache.org/r1553891 Log: [0.89-fb] [master] Ensure that rootRS death is processed correctly upon master failover.
Author: aaiyer Summary: We have seen cases where kill-hbase can get the system into a state where the master is not able to assign any of the regions, because the master is not assigning the root region. This essentially seems to happen, because the old rootRegionServer was being processed as dead. But, as part of that, the master does a root scan -- that would fail because the region server is dead. ProcessServerShutdown checks if the deadServer is the root server and tries to get root assigned, when ProcessServerShutdown is instantiated. However, in the case of a master failover, where the state is reconstructed from ZKClusterStateRecovery, the rootRegionServer reference could be updated, after the ProcessServerShutdown is created. In this case, PSS never makes any progress after the split log stage. Test Plan: re pro the scenario on my dev cluster. Reviewers: liyintang, rshroff Reviewed By: liyintang CC: mbm, hbase-eng@ Differential Revision: https://phabricator.fb.com/D1091877 Modified: hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java Modified: hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java URL: http://svn.apache.org/viewvc/hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java?rev=1553891&r1=1553890&r2=1553891&view=diff ============================================================================== --- hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java (original) +++ hbase/branches/0.89-fb/src/main/java/org/apache/hadoop/hbase/master/ProcessServerShutdown.java Sat Dec 28 19:18:17 2013 @@ -49,6 +49,7 @@ class ProcessServerShutdown extends Regi // Server name made of the concatenation of hostname, port and startcode // formatted as <code><hostname> ',' <port> ',' <startcode></code> private final String deadServer; + private long deadServerStartCode; private boolean isRootServer; private List<MetaRegion> metaRegions, metaRegionsUnassigned; private boolean rootRescanned; @@ -83,6 +84,7 @@ class ProcessServerShutdown extends Regi super(master, serverInfo.getServerName()); this.deadServer = serverInfo.getServerName(); this.deadServerAddress = serverInfo.getServerAddress(); + this.deadServerStartCode = serverInfo.getStartCode(); this.rootRescanned = false; this.successfulMetaScans = new HashSet<String>(); // check to see if I am responsible for either ROOT or any of the META tables. @@ -315,6 +317,22 @@ class ProcessServerShutdown extends Regi if (LOG.isDebugEnabled()) { HServerAddress addr = master.getRegionManager().getRootRegionLocation(); if (addr != null) { + if (addr.equals(deadServerAddress)) { + // We should not happen unless the master has restarted recently, because we + // explicitly call unsetRootRegion() in closeMetaRegions, which is called when + // ProcessServerShutdown was instantiated. + // However, in the case of a recovery by ZKClusterStateRecovery, it is possible that + // the rootRegion was updated after closeMetaRegions() was called. If we let the rootRegion + // point to a dead server, the cluster might just block, because all ScanRootRegion calls + // will continue to fail. Let us fix this, by ensuring that the root gets reassigned. + if (deadServerStartCode == master.getRegionManager().getRootServerInfo().getStartCode()) { + LOG.error(ProcessServerShutdown.this.toString() + " unsetting root because it is on the dead server being processed" ); + master.getRegionManager().reassignRootRegion(); + return false; + } else { + LOG.info(ProcessServerShutdown.this.toString() + " NOT unsetting root because it is on the dead server, but different start code" ); + } + } LOG.debug(ProcessServerShutdown.this.toString() + " scanning root region on " + addr.getBindAddress()); } else {
