ZKW.createUnassignedRegion doesn't make sure existing znode is in the right 
state
---------------------------------------------------------------------------------

                 Key: HBASE-2781
                 URL: https://issues.apache.org/jira/browse/HBASE-2781
             Project: HBase
          Issue Type: Bug
            Reporter: Jean-Daniel Cryans
            Assignee: Karthik Ranganathan
            Priority: Critical
             Fix For: 0.21.0


In ZKW.createUnassignedRegion I see this comment:

{code}
      // check if this node already exists - 
      //   - it should not exist
      //   - if it does, it should be in the CLOSED state
{code}

And what I got is:

{noformat}
2010-06-23 15:42:05,823 INFO  [IPC Server handler 3 on 60362] 
master.ServerManager(457): Processing MSG_REPORT_PROCESS_OPEN: 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. from 
h136.sfo.stumble.net,60365,1277332849712; 1 of 4
2010-06-23 15:42:05,867 INFO  [RegionServer:1.worker] 
regionserver.HRegionServer$Worker(1338): Worker: MSG_REGION_OPEN: 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:05,870 DEBUG [RegionServer:1.worker] 
regionserver.RSZookeeperUpdater(157): Updating ZNode 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 with [RS2ZK_REGION_OPENING] 
expected version = 0
2010-06-23 15:42:05,871 DEBUG [main-EventThread] master.HMaster(1158): Event 
NodeDataChanged with state SyncConnected with path 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:05,871 DEBUG [main-EventThread] 
master.ZKMasterAddressWatcher(64): Got event NodeDataChanged with path 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:05,871 DEBUG [main-EventThread] 
master.ZKUnassignedWatcher(95): ZK-EVENT-PROCESS: Got zkEvent NodeDataChanged 
state:SyncConnected path:/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:05,872 INFO  [main-EventThread] 
regionserver.HRegionServer(379): Got ZooKeeper event, state: SyncConnected, 
type: NodeDataChanged, path: /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:05,872 DEBUG [MASTER_OPENREGION-10.10.1.136:60362-1] 
handler.MasterOpenRegionHandler(77): Event = RS2ZK_REGION_OPENING, region = 
13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:05,874 DEBUG [RegionServer:1.worker] 
regionserver.HRegion(297): Creating region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:06,154 INFO  [RegionServer:1.worker] 
regionserver.HRegion(366): Onlined 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.; next sequenceid=1
2010-06-23 15:42:06,154 DEBUG [RegionServer:1.worker] 
regionserver.RSZookeeperUpdater(157): Updating ZNode 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 with [RS2ZK_REGION_OPENED] 
expected version = 1\
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:06,249 ERROR [RegionServer:1.worker] 
regionserver.HRegionServer(1488): Failed to mark region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. as opened
java.io.IOException: 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:06,993 DEBUG [RegionServer:1] 
regionserver.HRegionServer(1569): closing region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:06,993 DEBUG [RegionServer:1] regionserver.HRegion(487): 
Closing test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.: disabling 
compactions & flushes
2010-06-23 15:42:06,993 DEBUG [RegionServer:1] regionserver.HRegion(512): 
Updates disabled for region, no outstanding scanners on 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:06,993 DEBUG [RegionServer:1] regionserver.HRegion(519): No 
more row locks outstanding on region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:06,994 INFO  [RegionServer:1] regionserver.HRegion(531): 
Closed test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:09,105 INFO  [master] master.ProcessServerShutdown(126): 
Region test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. was in 
transition 
name=test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464., 
state=PENDING_OPEN on dead server h136.sfo.stumble.net,60365,1277332849712 - 
marking unassigned
2010-06-23 15:42:10,065 INFO  [IPC Server handler 2 on 60362] 
master.RegionManager(340): Assigning region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. to 
h136.sfo.stumble.net,60363,1277332849671
2010-06-23 15:42:10,067 DEBUG [IPC Server handler 2 on 60362] 
zookeeper.ZooKeeperWrapper(1079): While creating UNASSIGNED region 
13bef4950ac6827ac32d87682b8b2464 exists, state = RS2ZK_REGION_OPENING
2010-06-23 15:42:10,126 WARN  [IPC Server handler 2 on 60362] 
zookeeper.ZooKeeperWrapper(1024): 
<localhost:/1,org.apache.hadoop.hbase.master.HMaster>Failed to create ZNode 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 in ZooKeeper
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:10,127 DEBUG [IPC Server handler 2 on 60362] 
master.RegionManager(350): Created UNASSIGNED zNode 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. in state 
M2ZK_REGION_OFFLINE
2010-06-23 15:42:10,245 INFO  [RegionServer:0] regionserver.HRegionServer(511): 
MSG_REGION_OPEN: test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:11,248 INFO  [IPC Server handler 1 on 60362] 
master.ServerManager(457): Processing MSG_REPORT_PROCESS_OPEN: 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. from 
h136.sfo.stumble.net,60363,1277332849671; 7 of 13
2010-06-23 15:42:13,795 INFO  [RegionServer:0.worker] 
regionserver.HRegionServer$Worker(1338): Worker: MSG_REGION_OPEN: 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
2010-06-23 15:42:13,797 ERROR [RegionServer:0.worker] 
regionserver.RSZookeeperUpdater(107): ZNode 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 is not in CLOSED/OFFLINE state 
(state = RS2ZK_REGION_OPENING), will NOT open region.
2010-06-23 15:42:13,798 ERROR [RegionServer:0.worker] 
regionserver.HRegionServer(814): Error opening 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
java.io.IOException: ZNode /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 is 
not in CLOSED/OFFLINE state (state = RS2ZK_REGION_OPENING), will NOT open 
region.
2010-06-23 15:42:13,800 ERROR [RegionServer:0.worker] 
regionserver.RSZookeeperUpdater(141): Aborting open of region 
13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:13,800 DEBUG [RegionServer:0.worker] 
regionserver.RSZookeeperUpdater(157): Updating ZNode 
/1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464 with [RS2ZK_REGION_CLOSED] 
expected version = 0
org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = 
BadVersion for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
2010-06-23 15:42:13,802 ERROR [RegionServer:0.worker] 
regionserver.HRegionServer(1473): Failed to abort open region 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464.
java.io.IOException: org.apache.zookeeper.KeeperException$BadVersionException: 
KeeperErrorCode = BadVersion for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: 
KeeperErrorCode = BadVersion for /1/UNASSIGNED/13bef4950ac6827ac32d87682b8b2464
{noformat}

Basically:
 # A region server was opening the region
 # It was expired just before reporting that the region is opened, leaving the 
znode in the state RS2ZK_REGION_OPENING
 # The region gets reassigned, it sees that state, doesn't change it, but still 
outputs in the end "Created UNASSIGNED zNode 
test,lll,1277332918248.13bef4950ac6827ac32d87682b8b2464. in state 
M2ZK_REGION_OFFLINE"
 # When the region server opens the region, it sees that the state is wrong and 
aborts opening the region

I think that the way to fix it is to change the state to what it should be.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to