[
https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219092#comment-14219092
]
Stephen Yuan Jiang commented on HBASE-12464:
--------------------------------------------
The org.apache.hadoop.hbase.client.TestHCM.testClusterStatus test failure looks
unrelated to my change.
I can repro without my change; I also run the test with my changes under
breakpoints, none of the breakpoints was hit. The strange part is that I can
consistently repro in my machine (the latest master) with or without my change;
however, the last 3 official commits passed the test (though the last one
b6dd9b4 shown a strange passing message, instead of 'PASS', it said 'Fixed).
The javadoc warnings have nothing to do with this change and are pre-exist
(checked the recent committed JIRAs and they are there):
{code}
[WARNING] Javadoc Warnings
[WARNING]
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java:54:
warning: Unsafe is internal proprietary API and may be removed in a future
release
[WARNING] import sun.misc.Unsafe;
[WARNING] ^
[WARNING]
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-client/src/main/java/org/apache/hadoop/hbase/client/Admin.java:603:
warning - @param argument "regionserver" is not a parameter name.
[WARNING]
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/hbase-server/src/main/java/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.java:92:
warning - @param argument "<KEY>" is not a type parameter name.
{code}
> meta table region assignment stuck in the FAILED_OPEN state due to region
> server not fully ready to serve
> ---------------------------------------------------------------------------------------------------------
>
> Key: HBASE-12464
> URL: https://issues.apache.org/jira/browse/HBASE-12464
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 1.0.0, 2.0.0, 0.99.1
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Fix For: 2.0.0
>
> Attachments: HBASE-12464.v1-1.0.patch, HBASE-12464.v1-2.0.patch,
> HBASE-12464.v2-2.0.patch
>
> Original Estimate: 24h
> Time Spent: 7.4h
> Remaining Estimate: 1h
>
> meta table region assignment could reach to the 'FAILED_OPEN' state, which
> makes the region not available unless the target region server shutdown or
> manual resolution. This is undesirable state for meta tavle region.
> Here is the sequence how this could happen (the code is in
> AssignmentManager#assign()):
> Step 1: Master detects a region server (RS1) that hosts one meta table region
> is down, it changes the meta region state from 'online' to 'offline'
> Step 2: In a loop (with configuable maximumAttempts count, default is 10, and
> minimal is 1), AssignmentManager tries to find a RS to host the meta table
> region. If there is no RS available, it would loop forver by resetting the
> loop count (BUG#1 from this logic - a small bug)
> {code}
> if (region.isMetaRegion()) {
> try {
> Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
> if (i == maximumAttempts) i = 1; // ==> BUG: if
> maximumAttempts is 1, then the loop will end.
> continue;
> } catch (InterruptedException e) {
> ...
> }
> {code}
> Step 3: Once a new RS is found (RS2), inside the same loop as Step 2,
> AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 =>
> PENDING_OPEN, RS2). If for some reason that opening the region in RS2 failed
> (eg. the target RS2 is not ready to serve - ServerNotRunningYetException),
> AssignmentManager would change the state from (PENDING_OPEN, RS2) to
> (FAILED_OPEN, RS2). then it would retry (and even change the RS server to go
> to). The retry is up to maximumAttempts. Once the maximumAttempts is
> reached, the meta region will be in the 'FAILED_OPEN' state, unless either
> (1). RS2 shutdown to trigger region assignment again or (2). it is
> reassigned by an operator via HBase Shell.
> Based on the document ( http://hbase.apache.org/book/regions.arch.html ),
> this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states ,
> the master tries to close them again when they are reassigned by an operator
> via HBase Shell.".
> However, this is bad design, espcially for meta table region (it is arguable
> that the design is good for regular table - for this ticket, I am more focus
> on fixing the meta region availablity issue).
> I propose 2 possible fixes:
> Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta
> table region, reset the loop count so that it would not leave the loop with
> meta table region in FAILED_OPEN state.
> Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide
> a way to automatically trigger AssignmentManager::assign() after a short
> period of time (leaving any region in FAILED_OPEN state or other states like
> 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal
> the region).
> I think at least for 1.0.0, Fix#1 is good enough. We can open a task-type of
> JIRA for Fix#2 in future release.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)