CatalogTracker.waitForMeta can wait forever and totally stall a RS
------------------------------------------------------------------

                 Key: HBASE-3668
                 URL: https://issues.apache.org/jira/browse/HBASE-3668
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.1
            Reporter: Jean-Daniel Cryans
            Assignee: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.2


We got into a weird situation after hitting HBASE-3664 leaving almost half the 
region servers unable to open regions (we didn't kill the master but all RS 
were restarted).

Here's the relevant jstack from the region servers:

{code}

"regionserver60020-EventThread" daemon prio=10 tid=0x0000000041297800 nid=0x5c5 
in Object.wait() [0x00007fbd2b7f6000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00007fbd4e1ac6d0> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:328)
        - locked <0x00007fbd4e1ac6d0> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:363)
        at 
org.apache.hadoop.hbase.zookeeper.MetaNodeTracker.nodeDeleted(MetaNodeTracker.java:64)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeCreated(ZooKeeperNodeTracker.java:154)
        - locked <0x00007fbd4d7ab060> (a 
org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.nodeDataChanged(ZooKeeperNodeTracker.java:179)
        - locked <0x00007fbd4d7ab060> (a 
org.apache.hadoop.hbase.zookeeper.MetaNodeTracker)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:268)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
{code}

Every server in that state cannot receive any ZK event. See how we first do a 
nodeDataChanged, then nodeCreated, then nodeDeleted. This is correlated in the 
logs:

{code}
2011-03-17 17:57:03,636 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: 
regionserver:60020-0x12d627b723e081f Received ZooKeeper Event, 
type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/1028785192
2011-03-17 17:57:03,636 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
regionserver:60020-0x12d627b723e081f Unable to get data of znode 
/hbase/unassigned/1028785192 because node does not exist (not an error)
2011-03-17 17:57:03,637 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: 
regionserver:60020-0x12d627b723e081f Set watcher on existing znode 
/hbase/unassigned/1028785192
2011-03-17 17:57:03,637 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: 
Detected completed assignment of META, notifying catalog tracker
2011-03-17 17:57:06,988 DEBUG 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
Lookedup root region location, connection=org.
{code}

Node is updated, then is missing, than came back! The reason we're stuck is 
that CT.waitForMeta will wait forever:
{code}
     if (getMetaServerConnection(true) != null) {
        return metaLocation;
      }
      while(!stopped && !metaAvailable.get() &&
          (timeout == 0 || System.currentTimeMillis() < stop)) {
        metaAvailable.wait(timeout);
      }
{code}

So when it tried getMetaServerConnection the first time it didn't work, and 
then since the timeout is 0 then it waits. Even if it was looping, 
metaAvailable will never be true since we're already inside the event thread 
thus blocking any other event!

This is what happens when the RS then tries to update .META. after opening a 
region:

{code}
2011-03-17 17:59:06,557 DEBUG 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Interrupting 
thread Thread[PostOpenDeployTasks:f06b630822a161d0a8fe37481d851d05,5,main]
2011-03-17 17:59:06,557 WARN 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Exception 
running postOpenDeployTasks; region=f06b630822a161d0a8fe37481d851d05
org.apache.hadoop.hbase.NotAllMetaRegionsOnlineException: Interrupted
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:365)
        at 
org.apache.hadoop.hbase.catalog.MetaEditor.updateRegionLocation(MetaEditor.java:142)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegionServer.java:1359)

{code}

The master eventually tries to assign the region somewhere else where it may 
work, but then the balancer will screw everything up again.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to