[ 
https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858392#comment-13858392
 ] 

Jean-Marc Spaggiari commented on HBASE-8912:
--------------------------------------------

I tried the patch, and I think that it just moved the issue further :(

First, I restored default balancer to get normal behaviour.
{code}
2013-12-29 13:20:24,408 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
node1.domain.com,60020,1388341141398: Exception refreshing OPENING; 
region=87dc596f763bd1b43a63c4afd93e4f00, context=post_region_open
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /hbase/unassigned/87dc596f763bd1b43a63c4afd93e4f00
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:115)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)
    at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848)
    at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:811)
    at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:747)
    at 
org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:674)
    at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:380)
    at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:108)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
2013-12-29 13:20:24,413 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded 
coprocessors are: []
2013-12-29 13:20:24,420 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
refreshing OPENING; region=87dc596f763bd1b43a63c4afd93e4f00, 
context=post_region_open
2013-12-29 13:20:24,421 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node 
for 404a7ac95dc8ce89826206453c501e2a from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING failed, the node existed and was in the expected state but 
then when setting data we got a version mismatch
2013-12-29 13:20:24,423 INFO org.mortbay.log: Stopped 
SelectChannelConnector@0.0.0.0:60030
2013-12-29 13:20:24,434 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x1427652a35a108f Attempt to transition the unassigned node 
for 87dc596f763bd1b43a63c4afd93e4f00 from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state 
M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341141398
2013-12-29 13:20:24,435 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark 
region {NAME => 
'page,moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull,1379303806726.87dc596f763bd1b43a63c4afd93e4f00.',
 STARTKEY => 
'moc.krowtenrehtaeweht.www\x1Fhttp\x1F-1\x1F/gardening/cask0109\x1Fnull', 
ENDKEY => 'moc.nuhc9.iahgnahs\x1Fhttp\x1F-1\x1F/travels/23865/\x1Fnull', 
ENCODED => 87dc596f763bd1b43a63c4afd93e4f00,} as FAILED_OPEN. It's likely that 
the master already timed out this open attempt, and thus another RS already has 
the region.
2013-12-29 13:20:24,435 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
Caught throwable while processing event M_RS_OPEN_REGION
java.io.IOException: Aborting flush because server is abortted...
    at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1556)
    at 
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1539)
    at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1034)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:982)
    at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:947)
    at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.cleanupFailedOpen(OpenRegionHandler.java:365)
    at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:115)
    at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)
{code}

Ir crashed on region server.

I stopped the cluster, restarted it, and then I got one region pending 
transition for more than 5 minutes.

{code}
2013-12-29 13:22:37,716 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node 
for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_OPENING failed, the node existed but was version 7 not the 
expected version 6
2013-12-29 13:22:37,716 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed 
refreshing OPENING; region=75c96fb5c15793e04fb71d553a51619b, 
context=post_region_open
2013-12-29 13:22:37,749 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:60020-0x34335c5090e04bb Attempt to transition the unassigned node 
for 75c96fb5c15793e04fb71d553a51619b from RS_ZK_REGION_OPENING to 
RS_ZK_REGION_FAILED_OPEN failed, the node existed but was in the state 
M_ZK_REGION_OFFLINE set by the server node1.domain.com,60020,1388341328265
2013-12-29 13:22:37,751 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Unable to mark 
region {NAME => 
'page,ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011,1384385444837.75c96fb5c15793e04fb71d553a51619b.',
 STARTKEY => 
'ac.edudlicnep.www\x1Fhttp\x1F-1\x1F/s/ref=sr_nr_p_6_4\x1Frh=n%3A1064954%2Ck%3AArt+Supplies%2Cp_6%3AA22378Z03K0GID&bbn=1064954&keywords=Art+Supplies&ie=UTF8&qid=1343415953&rnid=331539011',
 ENDKEY => 
'ac.efilthgin\x1Fhttp\x1F-1\x1F/directory/all/all/all-virtuelle+four-bois+sport+piano+ecrans-geants+europeen+sandwichs+bar-etudiant+desserts+bluegrass+open-bar+jam\x1Fnull',
 ENCODED => 75c96fb5c15793e04fb71d553a51619b,} as FAILED_OPEN. It's likely that 
the master already timed out this open attempt, and thus another RS already has 
the region.
{code}

Then I stopped the master again, and this time it went well.

So just to test, with default balancer, I tried to balancer again and again, 
like every 3 minutes to give it a breath between 2 balancing, and I got again a 
region stuck in transition.

> [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to 
> OFFLINE
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8912
>                 URL: https://issues.apache.org/jira/browse/HBASE-8912
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Priority: Critical
>             Fix For: 0.94.16
>
>         Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, HBase-0.94 #1036 test 
> - testRetrying [Jenkins].html, log.txt, 
> org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt
>
>
> AM throws this exception which subsequently causes the master to abort: 
> {code}
> java.lang.IllegalStateException: Unexpected state : 
> testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b. 
> state=PENDING_OPEN, ts=1372891751912, 
> server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE.
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
>       at 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>       at java.lang.Thread.run(Thread.java:662)
> {code}
> This exception trace is from the failing test TestMetaReaderEditor which is 
> failing pretty frequently, but looking at the test code, I think this is not 
> a test-only issue, but affects the main code path. 
> https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to