[ 
https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219092#comment-14219092
 ] 

Stephen Yuan Jiang commented on HBASE-12464:
--------------------------------------------

The org.apache.hadoop.hbase.client.TestHCM.testClusterStatus test failure looks 
unrelated to my change.  
I can repro without my change; I also run the test with my changes under 
breakpoints, none of the breakpoints was hit.  The strange part is that I can 
consistently repro in my machine (the latest master) with or without my change; 
however, the last 3 official commits passed the test (though the last one 
b6dd9b4 shown a strange passing message, instead of 'PASS', it said 'Fixed).  

The javadoc warnings have nothing to do with this change and are pre-exist 
(checked the recent committed JIRAs and they are there):
{code}
[WARNING] Javadoc Warnings
[WARNING] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java:54:
 warning: Unsafe is internal proprietary API and may be removed in a future 
release
[WARNING] import sun.misc.Unsafe;
[WARNING] ^

[WARNING] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java:603:
 warning - @param argument "regionserver" is not a parameter name.

[WARNING] 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java:92:
 warning - @param argument "<KEY>" is not a type parameter name.
{code}

> meta table region assignment stuck in the FAILED_OPEN state due to region 
> server not fully ready to serve
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-12464
>                 URL: https://issues.apache.org/jira/browse/HBASE-12464
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 1.0.0, 2.0.0, 0.99.1
>            Reporter: Stephen Yuan Jiang
>            Assignee: Stephen Yuan Jiang
>             Fix For: 2.0.0
>
>         Attachments: HBASE-12464.v1-1.0.patch, HBASE-12464.v1-2.0.patch, 
> HBASE-12464.v2-2.0.patch
>
>   Original Estimate: 24h
>          Time Spent: 7.4h
>  Remaining Estimate: 1h
>
> meta table region assignment could reach to the 'FAILED_OPEN' state, which 
> makes the region not available unless the target region server shutdown or 
> manual resolution.  This is undesirable state for meta tavle region.
> Here is the sequence how this could happen (the code is in 
> AssignmentManager#assign()):
> Step 1: Master detects a region server (RS1) that hosts one meta table region 
> is down, it changes the meta region state from 'online' to 'offline'
> Step 2: In a loop (with configuable maximumAttempts count, default is 10, and 
> minimal is 1), AssignmentManager tries to find a RS to host the meta table 
> region.  If there is no RS available, it would loop forver by resetting the 
> loop count (BUG#1 from this logic - a small bug) 
> {code}
>            if (region.isMetaRegion()) {
>               try {
>                 Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
>                 if (i == maximumAttempts) i = 1; // ==> BUG: if 
> maximumAttempts is 1, then the loop will end.
>                 continue;
>               } catch (InterruptedException e) {
>               ...
>            }
> {code}
> Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, 
> AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 => 
> PENDING_OPEN, RS2).  If for some reason that opening the region in RS2 failed 
> (eg. the target RS2 is not ready to serve - ServerNotRunningYetException), 
> AssignmentManager would change the state from (PENDING_OPEN, RS2) to 
> (FAILED_OPEN, RS2).  then it would retry (and even change the RS server to go 
> to).  The retry is up to maximumAttempts.  Once the maximumAttempts is 
> reached, the meta region will be in the 'FAILED_OPEN' state, unless either 
> (1).  RS2 shutdown to trigger region assignment again or (2). it is 
> reassigned by an operator via HBase Shell.  
> Based on the document ( http://hbase.apache.org/book/regions.arch.html ), 
> this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , 
> the master tries to close them again when they are reassigned by an operator 
> via HBase Shell.".  
> However, this is bad design, espcially for meta table region (it is arguable 
> that the design is good for regular table - for this ticket, I am more focus 
> on fixing the meta region availablity issue).  
> I propose 2 possible fixes:
> Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta 
> table region, reset the loop count so that it would not leave the loop with 
> meta table region in FAILED_OPEN state.
> Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide 
> a way to automatically trigger AssignmentManager::assign() after a short 
> period of time (leaving any region in FAILED_OPEN state or other states like 
> 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal 
> the region).
> I think at least for 1.0.0, Fix#1 is good enough.  We can open a task-type of 
> JIRA for Fix#2 in future release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to