[ 
https://issues.apache.org/jira/browse/HBASE-3914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038411#comment-13038411
 ] 

Jieshan Bean commented on HBASE-3914:
-------------------------------------

Thanks for your quickly comments about this patch,Stack.
1."hbase.catalog.verification.timeout" is not new, about that, I referenced the 
original code(In method HMaster#assignRootAndMeta), any other place use it like 
this, so I just followed that.
   
2.I have thought about add this method in ServerShutdownHandler, and indeed if 
it is an private method will be better.

I will make the change immediately :)
   

> ROOT region appeared in two regionserver's onlineRegions at the same time
> -------------------------------------------------------------------------
>
>                 Key: HBASE-3914
>                 URL: https://issues.apache.org/jira/browse/HBASE-3914
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3914.patch
>
>
> This could be happen under the following steps with little probability:
> (I suppose the cluster nodes names are RS1/RS2/HM, and there's more than 
> 10,000 regions in the cluster)
> 1.Root region was opened in RS1.
> 2.Due to some reason(Maybe the hdfs process was got abnormal),RS1 aborted.
> 3.ServerShutdownHandler process start.
> 4.HMaster was restarted, during the finishInitialization's handling, ROOT 
> region was unsetted, and assigned to RS2. 
> 5.Root region was opened successfully in RS2.
> 6.But after while, ROOT region was unsetted again by RS1's 
> ServerShutdownHandler. Then it was reassigned. Before that, the RS1 was 
> restarted. So there's two possibilities:
>  Case a:
>    ROOT region was assigned to RS1. 
>    It seemed nothing would be affected. But the root region was still online 
> in RS2.  
>    
>  Case b:
>    ROOT region was assigned to RS2.    
>    The ROOT Region couldn't be opened until it would be reassigned to other 
> regionserver, because it was showed online in this regionserver.
> This could be proved from the logs:
> 1. ROOT region was opened with two times:
> 2011-05-17 10:32:59,188 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> -ROOT-,,0.70236052 on 162-2-77-0,20020,1305598359031
> 2011-05-17 10:33:01,536 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> -ROOT-,,0.70236052 on 162-2-16-6,20020,1305597548212
> 2.Regionserver 162-2-16-6 was aborted, so it was reassigned to 162-2-77-0, 
> but already online on this server:
> 10:49:30,920 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: 
> Received request to open region: -ROOT-,,0.70236052 10:49:30,920 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
> open of -ROOT-,,0.70236052 10:49:30,920 WARN 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Attempted 
> open of -ROOT-,,0.70236052 but already online on this server
> This could be cause a long break of ROOT region offline, though it happened 
> under a special scenario. And I have checked the code, it seems a tiny bug 
> here.
> There's 2 references about assignRoot():
> 1.
> HMaster# assignRootAndMeta:
>     if (!catalogTracker.verifyRootRegionLocation(timeout)) {
>       this.assignmentManager.assignRoot();
>       this.catalogTracker.waitForRoot();
>       assigned++;
>     }
> 2.
> ServerShutdownHandler# process: 
>     
>       if (isCarryingRoot()) { // -ROOT-      
>         try {        
>            this.services.getAssignmentManager().assignRoot();
>         } catch (KeeperException e) {
>            this.server.abort("In server shutdown processing, assigning root", 
> e);
>            throw new IOException("Aborting", e);
>         }
>       }    
> I think each time call the method of assignRoot(), we should verify Root 
> Region's Location first. Because before the assigning, the ROOT region could 
> have been assigned by another place.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to