[ 
https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069442#comment-13069442
 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

@J-D Thanks for your replay. :)

I got it. In my case, The race is between the disable threads and 
ClosedRegionHandler threads.

 1.Disable thread get region from regions collection (reference 
getRegionsOfTable)

 2.Thread pool gets region and sends request to region server. at the same time 
puts region into RIT(regionsInTransition),   it indicates that the region is 
processing.

 3.Region server finishs closing region and changes the zk state, notifies the 
master.

 4.When master receives the watcher event, It removes the region from RIT and 
then remove from regions collection.
   There is a short window when diable table can't finish in a period(). The 
region may be unssigned again.

My patch try to fix above case. remove regions collection firstly and disable 
thread can't get a processing region.


I found the issue yestertay, Enable threads is also a race condition.  
(I changed the period for 1 minutes because of reproducing the issue). It seems 
pool couldn't finish but a new enable process starts. we need a sleep time when 
a enable period finishes

The master logs:
2011-07-22 13:33:27,806 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO 
org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 
regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:60000-0x31502ef4f00000 Creating (or updating) unassigned node for 
c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Forcing OFFLINE; 
was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
No previous transition plan was found (or we are ignoring an existing plan) for 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a 
random one; 
hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, 
dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to 
C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: 
Unexpected state trying to OFFLINE; 
ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. 
state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
        at 
org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
        at 
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
        at 
org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-07-22 13:34:31,482 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
master:60000-0x31502ef4f00000 Received ZooKeeper Event, type=NodeDataChanged, 
state=SyncConnected, path=/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
2011-07-22 13:34:31,482 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: 
master:60000-0x31502ef4f00000 Unable to get data of znode 
/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
 


> Two concurrent unassigning of the same region caused the endless loop of 
> "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in 
> regionsInTransition(The RegionState was remained by some exception which 
> should be removed, that's why I called it as "rubbish" object), but the 
> region is not currently assigned anywhere, TimeoutMonitor will fall into an 
> endless loop:
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:21,441 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:31,215 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining)
> 2011-06-27 10:32:41,172 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same 
> region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event 
> of "RS_ZK_REGION_CLOSED", process this event, will create a 
> ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  
> "hbase.master.executor.closeregion.threads" thread (A), another unassign call 
> of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have 
> the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and 
> "regionsInTransition", then switch to thread B. the thread B run continue, 
> will throw an exception with the msg of "Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 9a6e26d40293663a79523c58315b930f", but without removing the new-adding 
> RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close 
> "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:20:59,585 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, 
> usedHeap=0, maxHeap=0) for region 
> test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f 
> (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., 
> server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, 
> region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region 
> 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on 
> queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x230ba8b85230036 Deleting existing unassigned node for 
> 9a6e26d40293663a79523c58315b930f that is in expected state 
> RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:20000-0x230ba8b85230036 Successfully deleted unassigned node for 
> region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:25:55,302 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> (offlining) 
> 2011-06-27 10:32:21,441 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is 
> not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 
> 'DisableTableHandler.process' will create another BulkDisabler object and 
> start its thread pool. The region which was still online will call 
> AssignmentManager.unassign again. so the same region 
> "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to