[ https://issues.apache.org/jira/browse/HBASE-12398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192803#comment-14192803 ]
Andrew Purtell commented on HBASE-12398: ---------------------------------------- +1 for 0.98 > Region isn't assigned in an extreme race condition > -------------------------------------------------- > > Key: HBASE-12398 > URL: https://issues.apache.org/jira/browse/HBASE-12398 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 0.98.7 > Reporter: Jeffrey Zhong > Assignee: Jeffrey Zhong > Attachments: HBASE-12398.patch > > > In a test, [~enis] has seen a condition which made one of the regions > unassigned. > The client failed since the region is not online anywhere: > {code} > 2014-10-29 01:51:40,731 WARN [HBaseReaderThread_13] > util.MultiThreadedReader: > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after > attempts=35, exceptions: > Wed Oct 29 01:39:51 UTC 2014, > org.apache.hadoop.hbase.client.RpcRetryingCaller@cc21330, > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: Region > IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9. > is not online on hor8n08.gq1.ygridcore.net,60020,1414546670414 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2774) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4257) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2906) > at > org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29990) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) > at > org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) > at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) > at java.lang.Thread.run(Thread.java:722) > {code} > The root cause of the issue is due to some extreme race condition: > a) a region is about to open and receives a closeRpc request triggered by a > second re-assignment > b) the second re-assignment updates region state to offline while immediately > is overwritten to OPEN from previous region open ZK opened notification > c) when the region reopened on the same RS by the second assignment, AM force > the region to close as the its region state isn't in PendingOpenOrOpening > state. > d) the region ends up offline & can't server any request > Region Server Side: > 1) A region almost opens region 689b77e1bad7e951b0d9ef4663b217e9 while the > RS(hor8n10) receives a closeRegion request. > {noformat} > 2014-10-29 01:39:43,153 INFO > [PriorityRpcServer.handler=2,queue=0,port=60020] regionserver.HRegionServer: > Received CLOSE for the region:689b77e1bad7e951b0d9ef4663b217e9 , which we are > already trying to OPEN. Cancelling OPENING. > {noformat} > 2) Since region 689b77e1bad7e951b0d9ef4663b217e9 was already opened right > before some final steps, so the RS logs the following message and close > 689b77e1bad7e951b0d9ef4663b217e9 immediately after the RS update ZK node > state to 'OPENED'. > {noformat} > 2014-10-29 01:39:43,198 ERROR [RS_OPEN_REGION-hor8n10:60020-0] > handler.OpenRegionHandler: Race condition: we've finished to open a region, > while a close was requested on > region=IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9.. > It can be a critical error, as a region that should be closed is now opened. > Closing it now > {noformat} > In Master Server Side: > {noformat} > 2014-10-29 01:39:43,177 DEBUG [AM.ZK.Worker-pool2-t55] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=hor8n10.gq1.ygridcore.net,60020,1414546531945, > region=689b77e1bad7e951b0d9ef4663b217e9, > current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPENING, > ts=1414546783152, server=hor8n10.gq1.ygridcore.net,60020,1414546531945} > .... > 2014-10-29 01:39:43,255 DEBUG [AM.-pool1-t16] master.AssignmentManager: > Offline > IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9., > it's not any more on hor8n10.gq1.ygridcore.net,60020,1414546531945 > .... > 2014-10-29 01:39:43,942 DEBUG [AM.ZK.Worker-pool2-t58] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=hor8n10.gq1.ygridcore.net,60020,1414546531945, > region=689b77e1bad7e951b0d9ef4663b217e9, > current_state={689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, ts=1414546783387, > server=hor8n10.gq1.ygridcore.net,60020,1414546531945} > 2014-10-29 01:39:43,942 WARN [AM.ZK.Worker-pool2-t58] > master.AssignmentManager: Received OPENED for > 689b77e1bad7e951b0d9ef4663b217e9 from > hor8n10.gq1.ygridcore.net,60020,1414546531945 but the region isn't > PENDING_OPEN/OPENING here: {689b77e1bad7e951b0d9ef4663b217e9 state=OPEN, > ts=1414546783387, server=hor8n10.gq1.ygridcore.net,60020,1414546531945} > 2014-10-29 01:39:43,948 DEBUG [AM.ZK.Worker-pool2-t58] > master.AssignmentManager: Sent CLOSE to > hor8n10.gq1.ygridcore.net,60020,1414546531945 for region > IntegrationTestRegionReplicaReplication,06666666,1414545619766_0001.689b77e1bad7e951b0d9ef4663b217e9. > {noformat} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)