[ 
https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257194#comment-13257194
 ] 

Maryann Xue commented on HBASE-5816:
------------------------------------

stack, your suggestion seems an ultimate solution to the current HMaster 
workflow. trunk has the same problem.
The case i attached were actually introduced by HBASE-5396, which tried to let 
ServerShutdownHandler assign the region in an earlier stage instead of waiting 
for the TimeoutMonitor to do the job. But the isRegionOnline test seems too 
weak here.
      for (HRegionInfo hri : regionsFromRegionPlansForServer) {
        if (!this.services.getAssignmentManager().isRegionOnline(hri)) {
          this.services.getAssignmentManager().assign(hri, true);
          reassignedPlans++;
        }
      }

However, i think any client call to HBaseAdmin.assign() that coincide at this 
point would cause the same problem. There is a lock guarding the private 
assign() method to deal with concurrent assigns, but the entire assign process 
is not atomic. It should be safe for the later thread just return or get an 
exception if the region has already been assigned by an earlier thread.
                
> Balancer and ServerShutdownHandler concurrently reassigning the same region
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-5816
>                 URL: https://issues.apache.org/jira/browse/HBASE-5816
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.6
>            Reporter: Maryann Xue
>            Priority: Critical
>         Attachments: HBASE-5816.patch
>
>
> The first assign thread exits with success after updating the RegionState to 
> PENDING_OPEN, while the second assign follows immediately into "assign" and 
> fails the RegionState check in setOfflineInZooKeeper(). This causes the 
> master to abort.
> In the below case, the two concurrent assigns occurred when AM tried to 
> assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler 
> tried to assign this region (from the region plan) spontaneously.
> 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance 
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
> src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> (offlining)
> 2012-04-17 05:44:57,648 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0) for region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.
> 2012-04-17 05:44:57,666 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
> node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b 
> (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING)
> 2012-04-17 05:52:58,984 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=CLOSED, ts=1334612697672, 
> server=hadoop05.sh.intel.com,60020,1334544902186
> 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for 
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  src=hadoop05.sh.intel.com,60020,1334544902186, 
> dest=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:52:59,096 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:19,159 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
> was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=PENDING_OPEN, ts=1334613179096, 
> server=xmlqa-clv16.sh.intel.com,60020,1334612497253
> 2012-04-17 05:54:59,033 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0
> java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket 
> timeout exception: java.net.SocketTimeoutException: 120000 millis timeout 
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
> remote=/10.239.47.87:60020]
>         at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805)
>         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283)
>         at $Proxy7.openRegion(Unknown Source)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
>         at 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.net.SocketTimeoutException: 120000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 
> remote=/10.239.47.87:60020]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>         at java.io.FilterInputStream.read(FilterInputStream.java:116)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>         at java.io.DataInputStream.readInt(DataInputStream.java:370)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541)
>         at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479)
> 2012-04-17 05:54:59,035 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
> was found (or we are ignoring an existing plan) for 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. so 
> generated a random one; 
> hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., 
> src=, dest=hadoop06.sh.intel.com,60020,1334544901894; 7 (online=7, 
> exclude=serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, 
> load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers
> 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for 
> fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state
> 2012-04-17 05:54:59,045 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for 
> region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; 
> plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.,
>  src=, dest=hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,045 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to 
> hadoop06.sh.intel.com,60020,1334544901894
> 2012-04-17 05:54:59,046 FATAL org.apache.hadoop.hbase.master.HMaster: 
> Unexpected state trying to OFFLINE; 
> TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 
> state=PENDING_OPEN, ts=1334613299045, 
> server=hadoop06.sh.intel.com,60020,1334544901894
> java.lang.IllegalStateException
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1167)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1107)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
>         at 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:259)
>         at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2012-04-17 05:54:59,047 INFO org.apache.hadoop.hbase.master.HMaster: Aborting

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to