[ https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038956#comment-13038956 ]
stack commented on HBASE-3914: ------------------------------ bq. In future, I will notice what you remind me, and I also hope I can contribute more patches. No problem and keep the patches coming! > ROOT region appeared in two regionserver's onlineRegions at the same time > ------------------------------------------------------------------------- > > Key: HBASE-3914 > URL: https://issues.apache.org/jira/browse/HBASE-3914 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Assignee: Jieshan Bean > Fix For: 0.90.4 > > Attachments: HBASE-3914-V2.patch, HBASE-3914.patch > > > This could be happen under the following steps with little probability: > (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than > 10,000 regions in the cluster) > 1.Root region was opened in RS1. > 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted. > 3.ServerShutdownHandler process start. > 4.HMaster was restarted, during the finishInitialization's handling, ROOT > region was unsetted, and assigned to RS2. > 5.Root region was opened successfully in RS2. > 6.But after while, ROOT region was unsetted again by RS1's > ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was > restarted. So there's two possibilities: > Case a: > ROOT region was assigned to RS1. > It seemed nothing would be affected. But the root region was still online > in RS2. > > Case b: > ROOT region was assigned to RS2. > The ROOT Region couldn't be opened until it would be reassigned to other > regionserver, because it was showed online in this regionserver. > This could be proved from the logs: > 1. ROOT region was opened with two times: > 2011-05-17 10:32:59,188 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031 > 2011-05-17 10:33:01,536 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region > -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212 > 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, > but already online on this server: > 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: > Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing > open of -ROOT-,,0.70236052 10:49:30,920 WARN > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted > open of -ROOT-,,0.70236052 but already online on this server > This could be cause a long break of ROOT region offline, though it happened > under a special scenario. And I have checked the code, it seems a tiny bug > here. > There's 2 references about assignRoot(): > 1. > HMaster# assignRootAndMeta: > if (!catalogTracker.verifyRootRegionLocation(timeout)) { > this.assignmentManager.assignRoot(); > this.catalogTracker.waitForRoot(); > assigned++; > } > 2. > ServerShutdownHandler# process: > > if (isCarryingRoot()) { // -ROOT- > try { > this.services.getAssignmentManager().assignRoot(); > } catch (KeeperException e) { > this.server.abort("In server shutdown processing, assigning root", > e); > throw new IOException("Aborting", e); > } > } > I think each time call the method of assignRoot(), we should verify Root > Region's Location first. Because before the assigning, the ROOT region could > have been assigned by another place. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira