[ https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen Yuan Jiang updated HBASE-12464: --------------------------------------- Attachment: HBASE-12464.v2-2.0.patch The HBASE-12464.v2-2.0 patch is based on the feedback from [~jxiang] and [~enis]. The target is for 2.0: - Add log warning entries if retries reaching to maximumAttempts - Handle AssignmentManger#onRegionFailedOpen if maximumAttempts is reached. > meta table region assignment stuck in the FAILED_OPEN state due to region > server not fully ready to serve > --------------------------------------------------------------------------------------------------------- > > Key: HBASE-12464 > URL: https://issues.apache.org/jira/browse/HBASE-12464 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 1.0.0, 0.99.1 > Reporter: Stephen Yuan Jiang > Assignee: Stephen Yuan Jiang > Fix For: 1.0.0, 0.99.2 > > Attachments: HBASE-12464.v1-1.0.patch, HBASE-12464.v1-2.0.patch, > HBASE-12464.v2-2.0.patch > > Original Estimate: 24h > Time Spent: 7.4h > Remaining Estimate: 1h > > meta table region assignment could reach to the 'FAILED_OPEN' state, which > makes the region not available unless the target region server shutdown or > manual resolution. This is undesirable state for meta tavle region. > Here is the sequence how this could happen (the code is in > AssignmentManager#assign()): > Step 1: Master detects a region server (RS1) that hosts one meta table region > is down, it changes the meta region state from 'online' to 'offline' > Step 2: In a loop (with configuable maximumAttempts count, default is 10, and > minimal is 1), AssignmentManager tries to find a RS to host the meta table > region. If there is no RS available, it would loop forver by resetting the > loop count (BUG#1 from this logic - a small bug) > {code} > if (region.isMetaRegion()) { > try { > Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment); > if (i == maximumAttempts) i = 1; // ==> BUG: if > maximumAttempts is 1, then the loop will end. > continue; > } catch (InterruptedException e) { > ... > } > {code} > Step 3: Once a new RS is found (RS2), inside the same loop as Step 2, > AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 => > PENDING_OPEN, RS2). If for some reason that opening the region in RS2 failed > (eg. the target RS2 is not ready to serve - ServerNotRunningYetException), > AssignmentManager would change the state from (PENDING_OPEN, RS2) to > (FAILED_OPEN, RS2). then it would retry (and even change the RS server to go > to). The retry is up to maximumAttempts. Once the maximumAttempts is > reached, the meta region will be in the 'FAILED_OPEN' state, unless either > (1). RS2 shutdown to trigger region assignment again or (2). it is > reassigned by an operator via HBase Shell. > Based on the document ( http://hbase.apache.org/book/regions.arch.html ), > this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states , > the master tries to close them again when they are reassigned by an operator > via HBase Shell.". > However, this is bad design, espcially for meta table region (it is arguable > that the design is good for regular table - for this ticket, I am more focus > on fixing the meta region availablity issue). > I propose 2 possible fixes: > Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta > table region, reset the loop count so that it would not leave the loop with > meta table region in FAILED_OPEN state. > Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide > a way to automatically trigger AssignmentManager::assign() after a short > period of time (leaving any region in FAILED_OPEN state or other states like > 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal > the region). > I think at least for 1.0.0, Fix#1 is good enough. We can open a task-type of > JIRA for Fix#2 in future release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)