[ https://issues.apache.org/jira/browse/HBASE-17275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857060#comment-15857060 ]
Hudson commented on HBASE-17275: -------------------------------- SUCCESS: Integrated in Jenkins build HBase-1.1-JDK8 #1923 (See [https://builds.apache.org/job/HBase-1.1-JDK8/1923/]) HBASE-17275 Assign timeout may cause region to be unassigned forever (tedyu: rev a8158b550053aa72815577f6e77786ed590c4817) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java > Assign timeout may cause region to be unassigned forever > -------------------------------------------------------- > > Key: HBASE-17275 > URL: https://issues.apache.org/jira/browse/HBASE-17275 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 1.2.3, 1.1.7 > Reporter: Allan Yang > Assignee: Allan Yang > Fix For: 1.4.0, 1.3.1, 1.2.5, 1.1.9 > > Attachments: HBASE-17275-branch-1.patch, > HBASE-17275-branch-1.v2.patch, HBASE-17275-branch-1.v3.patch > > > This is a real cased happened in my test cluster. > I have more 8000 regions to assign when I restart a cluster, but I only > started one regionserver. That means master need to assign these 8000 regions > to a single server(I know it is not right, but just for testing). > The rs recevied the open region rpc and began to open regions. But the due to > the hugh number of regions, , master timeout the rpc call(but actually some > region had already opened) after 1 mins, as you can see from log 1. > {noformat} > 1. 2016-11-22 10:17:32,285 INFO [example.org:30001.activeMasterManager] > master.AssignmentManager: Unable to communicate with > example.org,30003,1479780976834 in order to assign regions, > java.io.IOException: Call to /example.org:30003 failed on local exception: > org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, waitTime=60001, > operationTimeout=60000 expired. > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1338) > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1272) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:290) > at > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.openRegion(AdminProtos.java:30177) > at > org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:1000) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1719) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2828) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2775) > at > org.apache.hadoop.hbase.master.AssignmentManager.assignAllUserRegions(AssignmentManager.java:2876) > at > org.apache.hadoop.hbase.master.AssignmentManager.processDeadServersAndRegionsInTransition(AssignmentManager.java:646) > at > org.apache.hadoop.hbase.master.AssignmentManager.joinCluster(AssignmentManager.java:493) > at > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:796) > at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:188) > at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1711) > at java.lang.Thread.run(Thread.java:756) > Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=1, > waitTime=60001, operationTimeout=60000 expired. > at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:81) > at > org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1246) > ... 14 more > {noformat} > for the region 7e9aee32eb98a6fc9d503b99fc5f9615(like many others), after > timeout, master use a pool to re-assign them, as in 2 > {noformat} > 2. 2016-11-22 10:17:32,303 DEBUG [AM.-pool1-t26] master.AssignmentManager: > Force region state offline {7e9aee32eb98a6fc9d503b99fc5f9615 > state=PENDING_OPEN, ts=1479780992078, server=example.org,30003,1479780976834} > > {noformat} > But, this region was actually opened on the rs, but (maybe) due to the hugh > pressure, the OPENED zk event recevied by master , as you can tell from 3, > "which is more than 15 seconds late" > {noformat} > 3. 2016-11-22 10:17:32,304 DEBUG [AM.ZK.Worker-pool2-t3] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=example.org,30003,1479780976834, > region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, > current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, > ts=1479780992078, server=example.org,30003,1479780976834} > {noformat} > In the meantime, master still try to re-assign this region in another thread. > Master first close this region in case of multi assign, then change the state > of this region change from PENDING_OPEN >OFFLINE>PENDING_OPEN. Its RIT node > in zk was also transitioned to OFFLINE, as in 4,5,6,7 > {noformat} > 4. 2016-11-22 10:17:32,321 DEBUG [AM.-pool1-t26] master.AssignmentManager: > Sent CLOSE to example.org,30003,1479780976834 for region > test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615. > 5. 2016-11-22 10:17:32,461 INFO [AM.-pool1-t26] master.RegionStates: > Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, > ts=1479781052344, server=example.org,30003,1479780976834} to > {7e9aee32eb98a6fc9d503b99fc5f9615 state=OFFLINE, ts=1479781052461, > server=example.org,30003,1479780976834} > 6. 2016-11-22 10:17:32,469 DEBUG [AM.-pool1-t26] zookeeper.ZKAssign: > master:30001-0x15810b5f633015e, > quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, > baseZNode=/test-hbase11-func2 Creating (or updating) unassigned node > 7e9aee32eb98a6fc9d503b99fc5f9615 with OFFLINE state > 7. 2016-11-22 10:17:32,546 INFO [AM.-pool1-t26] master.AssignmentManager: > Assigning test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615. to > example.org,30003,1479780976834 > {noformat} > The thread handling the OPENED zk event still operating on this region too. > IT change the state the region state from PENDING_OPEN to OPENED, but when > trying to delete the RIT zk node, it found out this zk node was in a illeagl > state OFFLINE, which was set in 6. So the RIT node can't be deleted. Now, the > inconsistent case appeared. For this region, its state shows OPENED, but in > RIT node, its state shows OFFLINE > {noformat} > 8. 2016-11-22 10:17:32,563 INFO [AM.ZK.Worker-pool2-t3] master.RegionStates: > Transition {7e9aee32eb98a6fc9d503b99fc5f9615 state=PENDING_OPEN, > ts=1479781052546, server=example.org,30003,1479780976834} to > {7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, > server=example.org,30003,1479780976834} > 9. 2016-11-22 10:17:32,595 WARN [AM.ZK.Worker-pool2-t3] zookeeper.ZKAssign: > master:30001-0x15810b5f633015e, > quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, > baseZNode=/test-hbase11-func2 Attempting to delete unassigned node > 7e9aee32eb98a6fc9d503b99fc5f9615 in RS_ZK_REGION_OPENED state but node is in > M_ZK_REGION_OFFLINE state > 10. 2016-11-22 10:17:32,595 WARN [AM.ZK.Worker-pool2-t12] > zookeeper.ZKAssign: master:30001-0x15810b5f633015e, > quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, > baseZNode=/test-hbase11-func2 Attempting to delete unassigned node > 4fb51c279d694f8808739ea632673584 in RS_ZK_REGION_OPENED state but node is in > M_ZK_REGION_OFFLINE state > 11. 2016-11-22 10:17:32,595 ERROR [AM.ZK.Worker-pool2-t3] > coordination.ZkOpenRegionCoordination: Znode of region > 7e9aee32eb98a6fc9d503b99fc5f9615 could not be deleted. > {noformat} > After a while, the second assign's zk event arrived, due to the inconsistent > state(expect PENDING_OPEN but OPENED), master unassign this region, then this > region will remain unassigned > {noformat} > 12. 2016-11-22 10:17:51,305 DEBUG [AM.ZK.Worker-pool2-t20] > master.AssignmentManager: Handling RS_ZK_REGION_OPENED, > server=example.org,30003,1479780976834, > region=7e9aee32eb98a6fc9d503b99fc5f9615, which is more than 15 seconds late, > current_state={7e9aee32eb98a6fc9d503b99fc5f9615 state=OPEN, ts=1479781052563, > server=example.org,30003,1479780976834} > 13. 2016-11-22 10:17:51,305 WARN [AM.ZK.Worker-pool2-t20] > master.AssignmentManager: Received OPENED for > 7e9aee32eb98a6fc9d503b99fc5f9615 from example.org,30003,1479780976834 but the > region isn't PENDING_OPEN/OPENING here: {7e9aee32eb98a6fc9d503b99fc5f9615 > state=OPEN, ts=1479781052563, server=example.org,30003,1479780976834} > 14. 2016-11-22 10:17:51,308 DEBUG [AM.ZK.Worker-pool2-t20] > master.AssignmentManager: Sent CLOSE to example.org,30003,1479780976834 for > region test,P7HQ55,1475985973151.7e9aee32eb98a6fc9d503b99fc5f9615. > {noformat} > Restart master may solve this issue, but due to HBASE-17264 and HBASE-17265, > restart master may still not helping. -- This message was sent by Atlassian JIRA (v6.3.15#6346)