[ https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754245#comment-13754245 ]
Jeffrey Zhong commented on HBASE-9387: -------------------------------------- After checked the log, the root cause is that META region get lost such as Master think it's assigned while the region server which were hosting the META closed it. In the end, no region server is hosting META. As you will see in the following logs, RS successfully opened META region while has issue to update the ZK data to opened and it then retries. While one of retries went through, Master deleted the RIT znode. RS failed to set ZK state to opened because the RIT znode is deleted by Master already and then RS close the META region. Later it also fails to set RIT state to failed open to notify Master because the RIT znode is deleted. Since the issue could happen for any region assignment, I think we can fix the issue as following: Check Meta to see if current RS owns the region. If current RS owns it, then just proceed the post open process otherwise doing the logic as today. If we can't get info from META on this region, we should abort current RS. Below are related logs: {code} 2013-08-29 22:15:29,978 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003 Transitioning 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED ... 2013-08-29 22:15:32,746 WARN [RS_OPEN_META-kiyo:57016-0] zookeeper.RecoverableZooKeeper(258): Possibly transient ZooKeeper, quorum=localhost:61251, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/region-in-transition/1588230740 2013-08-29 22:15:32,746 INFO [RS_OPEN_META-kiyo:57016-0] util.RetryCounter(54): Sleeping 2000ms before retry #1... ... 2013-08-29 22:15:34,148 DEBUG [AM.ZK.Worker-pool2-t4] master.AssignmentManager(801): Handling transition=RS_ZK_REGION_OPENED, server=kiyo.gq1.ygridcore.net,57016,1377814510039, region=1588230740, current state from region state map ={1588230740 state=OPENING, ts=1377814529578, server=kiyo.gq1.ygridcore.net,57016,1377814510039} ... 2013-08-29 22:15:34,149 DEBUG [AM.ZK.Worker-pool2-t4] zookeeper.ZKAssign(405): master:44745-0x140cc24e86a0001-0x140cc24e86a0001 Deleting existing unassigned node 1588230740 in expected state RS_ZK_REGION_OPENED 2013-08-29 22:15:34,179 DEBUG [pool-1-thread-1-EventThread] zookeeper.ZooKeeperWatcher(312): master:44745-0x140cc24e86a0001-0x140cc24e86a0001 Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/region-in-transition/1588230740 ... 2013-08-29 22:15:34,760 WARN [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(866): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 Attempt to transition the unassigned node for 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED failed, the node existed and was in the expected state but then when setting data it no longer existed 2013-08-29 22:15:34,761 WARN [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(370): Completed the OPEN of region hbase:meta,,1.1588230740 but when transitioning from OPENING to OPENED got a version mismatch, someone else clashed so now unassigning -- closing region on server: kiyo.gq1.ygridcore.net,57016,1377814510039 2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(950): Closing hbase:meta,,1.1588230740: disabling compactions & flushes 2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(972): Updates disabled for region hbase:meta,,1.1588230740 2013-08-29 22:15:34,762 INFO [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore(668): Closed info 2013-08-29 22:15:34,762 INFO [RS_OPEN_META-kiyo:57016-0] regionserver.HRegion(1030): Closed hbase:meta,,1.1588230740 2013-08-29 22:15:34,763 INFO [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(396): Opening of region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 1 2013-08-29 22:15:34,763 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 Transitioning 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN 2013-08-29 22:15:34,765 DEBUG [RS_OPEN_META-kiyo:57016-0] zookeeper.ZKUtil(784): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 Unable to get data of znode /hbase/region-in-transition/1588230740 because node does not exist (not necessarily an error) 2013-08-29 22:15:34,765 WARN [RS_OPEN_META-kiyo:57016-0] handler.OpenRegionHandler(404): Unable to mark region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. {code} > Region could get lost during assignment > --------------------------------------- > > Key: HBASE-9387 > URL: https://issues.apache.org/jira/browse/HBASE-9387 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 0.95.2 > Reporter: Ted Yu > Assignee: Ted Yu > Attachments: 9387-v1.txt, > org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt > > > I observed test timeout running against hadoop 2.1.0 with distributed log > replay turned on. > Looks like region state for 1588230740 became inconsistent between master and > the surviving region server: > {code} > 2013-08-29 22:15:34,180 INFO [AM.ZK.Worker-pool2-t4] > master.RegionStates(299): Onlined 1588230740 on > kiyo.gq1.ygridcore.net,57016,1377814510039 > ... > 2013-08-29 22:15:34,587 DEBUG [Thread-221] > client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta > parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, > hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of > 35 failed; retrying after sleep of 302 because: > org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being > opened: 1588230740 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063) > at > org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165) > at > org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira