[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257255#comment-13257255 ]
stack commented on HBASE-5816: ------------------------------ Should we have the servershutdownhandler and the balancer feed a single queue that assignment manager pulls from? If the region is already in the queue then we'd favor the purposed assignment (the balancers?) rather than the random one? > Balancer and ServerShutdownHandler concurrently reassigning the same region > --------------------------------------------------------------------------- > > Key: HBASE-5816 > URL: https://issues.apache.org/jira/browse/HBASE-5816 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 0.90.6 > Reporter: Maryann Xue > Assignee: ramkrishna.s.vasudevan > Priority: Critical > Attachments: HBASE-5816.patch > > > The first assign thread exits with success after updating the RegionState to > PENDING_OPEN, while the second assign follows immediately into "assign" and > fails the RegionState check in setOfflineInZooKeeper(). This causes the > master to abort. > In the below case, the two concurrent assigns occurred when AM tried to > assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler > tried to assign this region (from the region plan) spontaneously. > 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > (offlining) > 2012-04-17 05:44:57,648 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0) for region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > 2012-04-17 05:44:57,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned > node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b > (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) > 2012-04-17 05:52:58,984 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=CLOSED, ts=1334612697672, > server=hadoop05.sh.intel.com,60020,1334544902186 > 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=hadoop05.sh.intel.com,60020,1334544902186, > dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:52:59,096 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:19,159 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; > was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613179096, > server=xmlqa-clv16.sh.intel.com,60020,1334612497253 > 2012-04-17 05:54:59,033 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, > regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 > java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket > timeout exception: java.net.SocketTimeoutException: 120000 millis timeout > while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) > at $Proxy7.openRegion(Unknown Source) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > Caused by: java.net.SocketTimeoutException: 120000 millis timeout while > waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 > remote=/10.239.47.87:60020] > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:301) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:541) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:479) > 2012-04-17 05:54:59,035 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan > was found (or we are ignoring an existing plan) for > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. so > generated a random one; > hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=, dest=hadoop06.sh.intel.com,60020,1334544901894; 7 (online=7, > exclude=serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, > load=(requests=0, regions=0, usedHeap=0, maxHeap=0)) available servers > 2012-04-17 05:54:59,035 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x236b912e9b3000e Creating (or updating) unassigned node for > fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state > 2012-04-17 05:54:59,045 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for > region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; > plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., > src=, dest=hadoop06.sh.intel.com,60020,1334544901894 > 2012-04-17 05:54:59,045 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Assigning region > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to > hadoop06.sh.intel.com,60020,1334544901894 > 2012-04-17 05:54:59,046 FATAL org.apache.hadoop.hbase.master.HMaster: > Unexpected state trying to OFFLINE; > TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. > state=PENDING_OPEN, ts=1334613299045, > server=hadoop06.sh.intel.com,60020,1334544901894 > java.lang.IllegalStateException > at > org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1167) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1107) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892) > at > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:259) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:162) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > 2012-04-17 05:54:59,047 INFO org.apache.hadoop.hbase.master.HMaster: Aborting -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira