[ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ramkrishna.s.vasudevan updated HBASE-4064: ------------------------------------------ Fix Version/s: (was: 0.90.6) 0.90.7 > Two concurrent unassigning of the same region caused the endless loop of > "Region has been PENDING_CLOSE for too long..." > ------------------------------------------------------------------------------------------------------------------------ > > Key: HBASE-4064 > URL: https://issues.apache.org/jira/browse/HBASE-4064 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Fix For: 0.90.7 > > Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, > disableflow.png > > > 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in > regionsInTransition(The RegionState was remained by some exception which > should be removed, that's why I called it as "rubbish" object), but the > region is not currently assigned anywhere, TimeoutMonitor will fall into an > endless loop: > 2011-06-27 10:32:21,326 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > state=PENDING_CLOSE, ts=1309141555301 > 2011-06-27 10:32:21,326 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:32:21,438 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:32:21,441 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is > not currently assigned anywhere > 2011-06-27 10:32:31,207 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > state=PENDING_CLOSE, ts=1309141555301 > 2011-06-27 10:32:31,207 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:32:31,215 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:32:31,215 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is > not currently assigned anywhere > 2011-06-27 10:32:41,164 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > state=PENDING_CLOSE, ts=1309141555301 > 2011-06-27 10:32:41,164 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:32:41,172 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:32:41,172 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is > not currently assigned anywhere > ..... > 2 In the following scenario, two concurrent unassigning call of the same > region may lead to the above problem: > the first unassign call send rpc call success, the master watched the event > of "RS_ZK_REGION_CLOSED", process this event, will create a > ClosedRegionHandler to remove the state of the region in master.eg. > while ClosedRegionHandler is running in > "hbase.master.executor.closeregion.threads" thread (A), another unassign call > of same region run in another thread(B). > while thread B run "if (!regions.containsKey(region))", this.regions have > the region info, now cpu switch to thread A. > The thread A will remove the region from the sets of "this.regions" and > "regionsInTransition", then switch to thread B. the thread B run continue, > will throw an exception with the msg of "Server null returned > java.lang.NullPointerException: Passed server is null for > 9a6e26d40293663a79523c58315b930f", but without removing the new-adding > RegionState from "regionsInTransition",and it can not be removed for ever. > public void unassign(HRegionInfo region, boolean force) { > LOG.debug("Starting unassignment of region " + > region.getRegionNameAsString() + " (offlining)"); > synchronized (this.regions) { > // Check if this region is currently assigned > if (!regions.containsKey(region)) { > LOG.debug("Attempted to unassign region " + > region.getRegionNameAsString() + " but it is not " + > "currently assigned anywhere"); > return; > } > } > String encodedName = region.getEncodedName(); > // Grab the state of this region and synchronize on it > RegionState state; > long stamp = -1; > synchronized (regionsInTransition) { > state = regionsInTransition.get(encodedName); > if (state == null) { > state = new RegionState(region, RegionState.State.PENDING_CLOSE); > stamp =state.getStamp(); > regionsInTransition.put(encodedName, state); > } else if (force && state.isPendingClose()) { > LOG.debug("Attempting to unassign region " + > region.getRegionNameAsString() + " which is already pending close > " > + "but forcing an additional close"); > state.update(RegionState.State.PENDING_CLOSE); > } else { > LOG.debug("Attempting to unassign region " + > region.getRegionNameAsString() + " but it is " + > "already in transition (" + state.getState() + ")"); > return; > } > } > > // Send CLOSE RPC > HServerInfo server = null; > synchronized (this.regions) { > server = regions.get(region); > } > } > 2011-06-27 10:20:59,583 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:20:59,585 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, > usedHeap=0, maxHeap=0) for region > test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:22:15,299 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f > (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., > server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) > 2011-06-27 10:22:15,299 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, > region=9a6e26d40293663a79523c58315b930f > 2011-06-27 10:25:22,636 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > state=CLOSED, ts=1309141335247 > 2011-06-27 10:25:22,636 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region > 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on > queued ClosedRegionHandler to run or server shutdown > 2011-06-27 10:25:55,022 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 9a6e26d40293663a79523c58315b930f > 2011-06-27 10:25:55,022 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:20000-0x230ba8b85230036 Deleting existing unassigned node for > 9a6e26d40293663a79523c58315b930f that is in expected state > RS_ZK_REGION_CLOSED > 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:20000-0x230ba8b85230036 Successfully deleted unassigned node for > region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED > 2011-06-27 10:25:55,301 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:25:55,302 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 9a6e26d40293663a79523c58315b930f > 2011-06-27 10:32:21,326 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > state=PENDING_CLOSE, ts=1309141555301 > 2011-06-27 10:32:21,326 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > 2011-06-27 10:32:21,438 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. > (offlining) > 2011-06-27 10:32:21,441 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is > not currently assigned anywhere > 3 The following scenario shows how the above problem 2 happened: > (1)A table have a lot of regions, more than 70000 in my test. > (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, > 'DisableTableHandler.process' will create another BulkDisabler object and > start its thread pool. The region which was still online will call > AssignmentManager.unassign again. so the same region > "AssignmentManager.unassign" could be called concurrentlly more than 1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira