[ https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858392#comment-13858392 ]
Jean-Marc Spaggiari commented on HBASE-8912: -------------------------------------------- I tried the patch, and I think that it just moved the issue further :( First, I restored default balancer to get normal behaviour. {code} 2013-12-29 13:20:24,408 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server node1.domain.com,60020,1388341141398: Exception refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00 at org.apache.zookeeper.KeeperException.create(KeeperException.java:115) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349) at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811) at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747) at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) 2013-12-29 13:20:24,413 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2013-12-29 13:20:24,420 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open 2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but then when setting data we got a version mismatch 2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030 2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341141398 2013-12-29 13:20:24,435 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.', STARTKEY => 'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull', ENDKEY => 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull', ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. 2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION java.io.IOException: Aborting flush because server is abortted... at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539) at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982) at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {code} Ir crashed on region server. I stopped the cluster, restarted it, and then I got one region pending transition for more than 5 minutes. {code} 2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the expected version 6 2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b, context=post_region_open 2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341328265 2013-12-29 13:22:37,751 WARN org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark region {NAME => 'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.', STARTKEY => 'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011', ENDKEY => 'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull', ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that the master already timed out this open attempt, and thus another RS already has the region. {code} Then I stopped the master again, and this time it went well. So just to test, with default balancer, I tried to balancer again and again, like every 3 minutes to give it a breath between 2 balancing, and I got again a region stuck in transition. > [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to > OFFLINE > ---------------------------------------------------------------------------------- > > Key: HBASE-8912 > URL: https://issues.apache.org/jira/browse/HBASE-8912 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Priority: Critical > Fix For: 0.94.16 > > Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test > - testRetrying [Jenkins].html, log.txt, > org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt > > > AM throws this exception which subsequently causes the master to abort: > {code} > java.lang.IllegalStateException: Unexpected state : > testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b. > state=PENDING_OPEN, ts=1372891751912, > server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE. > at > org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > {code} > This exception trace is from the failing test TestMetaReaderEditor which is > failing pretty frequently, but looking at the test code, I think this is not > a test-only issue, but affects the main code path. > https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)