[ 
https://issues.apache.org/jira/browse/HBASE-9387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13754245#comment-13754245
 ] 

Jeffrey Zhong commented on HBASE-9387:
--------------------------------------

After checked the log, the root cause is that META region get lost such as 
Master think it's assigned while the region server which were hosting the META 
closed it. In the end, no region server is hosting META.

As you will see in the following logs, RS successfully opened META region while 
has issue to update the ZK data to opened and it then retries. While one of 
retries went through, Master deleted the RIT znode. RS failed to set ZK state 
to opened because the RIT znode is deleted by Master already and then RS close 
the META region. Later it also fails to set RIT state to failed open to notify 
Master because the RIT znode is deleted. 

Since the issue could happen for any region assignment, I think we can fix the 
issue as following:

Check Meta to see if current RS owns the region. If current RS owns it, then 
just proceed the post open process otherwise doing the logic as today. If we 
can't get info from META on this region, we should abort current RS. 


Below are related logs:

{code}
2013-08-29 22:15:29,978 DEBUG [RS_OPEN_META-kiyo:57016-0] 
zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003 Transitioning 
1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED
...
2013-08-29 22:15:32,746 WARN  [RS_OPEN_META-kiyo:57016-0] 
zookeeper.RecoverableZooKeeper(258): Possibly transient ZooKeeper, 
quorum=localhost:61251, 
exception=org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for /hbase/region-in-transition/1588230740
2013-08-29 22:15:32,746 INFO  [RS_OPEN_META-kiyo:57016-0] 
util.RetryCounter(54): Sleeping 2000ms before retry #1...
...
2013-08-29 22:15:34,148 DEBUG [AM.ZK.Worker-pool2-t4] 
master.AssignmentManager(801): Handling transition=RS_ZK_REGION_OPENED, 
server=kiyo.gq1.ygridcore.net,57016,1377814510039, region=1588230740, current 
state from region state map ={1588230740 state=OPENING, ts=1377814529578, 
server=kiyo.gq1.ygridcore.net,57016,1377814510039}
...
2013-08-29 22:15:34,149 DEBUG [AM.ZK.Worker-pool2-t4] zookeeper.ZKAssign(405): 
master:44745-0x140cc24e86a0001-0x140cc24e86a0001 Deleting existing unassigned 
node 1588230740 in expected state RS_ZK_REGION_OPENED
2013-08-29 22:15:34,179 DEBUG [pool-1-thread-1-EventThread] 
zookeeper.ZooKeeperWatcher(312): 
master:44745-0x140cc24e86a0001-0x140cc24e86a0001 Received ZooKeeper Event, 
type=NodeDeleted, state=SyncConnected, 
path=/hbase/region-in-transition/1588230740
...
2013-08-29 22:15:34,760 WARN  [RS_OPEN_META-kiyo:57016-0] 
zookeeper.ZKAssign(866): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 
Attempt to transition the unassigned node for 1588230740 from 
RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED failed, the node existed and was in 
the expected state but then when setting data it no longer existed
2013-08-29 22:15:34,761 WARN  [RS_OPEN_META-kiyo:57016-0] 
handler.OpenRegionHandler(370): Completed the OPEN of region 
hbase:meta,,1.1588230740 but when transitioning from  OPENING to OPENED got a 
version mismatch, someone else clashed so now unassigning -- closing region on 
server: kiyo.gq1.ygridcore.net,57016,1377814510039
2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] 
regionserver.HRegion(950): Closing hbase:meta,,1.1588230740: disabling 
compactions & flushes
2013-08-29 22:15:34,761 DEBUG [RS_OPEN_META-kiyo:57016-0] 
regionserver.HRegion(972): Updates disabled for region hbase:meta,,1.1588230740
2013-08-29 22:15:34,762 INFO  [StoreCloserThread-hbase:meta,,1.1588230740-1] 
regionserver.HStore(668): Closed info
2013-08-29 22:15:34,762 INFO  [RS_OPEN_META-kiyo:57016-0] 
regionserver.HRegion(1030): Closed hbase:meta,,1.1588230740
2013-08-29 22:15:34,763 INFO  [RS_OPEN_META-kiyo:57016-0] 
handler.OpenRegionHandler(396): Opening of region {ENCODED => 1588230740, NAME 
=> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from 
OPENING to FAILED_OPEN in ZK, expecting version 1
2013-08-29 22:15:34,763 DEBUG [RS_OPEN_META-kiyo:57016-0] 
zookeeper.ZKAssign(786): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 
Transitioning 1588230740 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN
2013-08-29 22:15:34,765 DEBUG [RS_OPEN_META-kiyo:57016-0] 
zookeeper.ZKUtil(784): regionserver:57016-0x140cc24e86a0003-0x140cc24e86a0003 
Unable to get data of znode /hbase/region-in-transition/1588230740 because node 
does not exist (not necessarily an error)
2013-08-29 22:15:34,765 WARN  [RS_OPEN_META-kiyo:57016-0] 
handler.OpenRegionHandler(404): Unable to mark region {ENCODED => 1588230740, 
NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's 
likely that the master already timed out this open attempt, and thus another RS 
already has the region.
{code}

                
> Region could get lost during assignment
> ---------------------------------------
>
>                 Key: HBASE-9387
>                 URL: https://issues.apache.org/jira/browse/HBASE-9387
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 0.95.2
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>         Attachments: 9387-v1.txt, 
> org.apache.hadoop.hbase.TestFullLogReconstruction-output.txt
>
>
> I observed test timeout running against hadoop 2.1.0 with distributed log 
> replay turned on.
> Looks like region state for 1588230740 became inconsistent between master and 
> the surviving region server:
> {code}
> 2013-08-29 22:15:34,180 INFO  [AM.ZK.Worker-pool2-t4] 
> master.RegionStates(299): Onlined 1588230740 on 
> kiyo.gq1.ygridcore.net,57016,1377814510039
> ...
> 2013-08-29 22:15:34,587 DEBUG [Thread-221] 
> client.HConnectionManager$HConnectionImplementation(1269): locateRegionInMeta 
> parentTable=hbase:meta, metaLocation={region=hbase:meta,,1.1588230740, 
> hostname=kiyo.gq1.ygridcore.net,57016,1377814510039, seqNum=0}, attempt=2 of 
> 35 failed; retrying after sleep of 302 because: 
> org.apache.hadoop.hbase.exceptions.RegionOpeningException: Region is being 
> opened: 1588230740
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2574)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3949)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2733)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:26965)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2063)
>         at 
> org.apache.hadoop.hbase.ipc.RpcServer$CallRunner.run(RpcServer.java:1800)
>         at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:165)
>         at 
> org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:41)
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to