I'm trying to test HBASE-5589 -- to see if I can add an API call to
HMasterInterface and do a rolling-restart / upgrade on a live cluster which
lead me down another rabbit hole.

I'm wondering how rolling-restart.sh script worked in the past (I can spend
more time setting up an older version to test this, but figured I'd ask).

I'm getting stuck when the bin/rolling-restart.sh tries to wait until the
Master ZNode expires.  In this particular case, the script seems to hang
there forever (even after the /hbase/master ephemeral node expires).

Here's the code in the script:
----
# make sure the master znode has been deleted before continuing
    zparent=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
zookeeper.znode.parent`
    if [ "$zparent" == "null" ]; then zparent="/hbase"; fi
    zmaster=`$bin/hbase org.apache.hadoop.hbase.util.HBaseConfTool
zookeeper.znode.master`
    if [ "$zmaster" == "null" ]; then zmaster="master"; fi
    zmaster=$zparent/$zmaster
    echo -n "Waiting for Master ZNode ${zmaster} to expire"
    while bin/hbase zkcli stat $zmaster >/dev/null 2>&1; do
      echo -n "."
      sleep 1
    done
    echo #force a newline
----

The problem is that 'bin/hbase zkcli stat /hbase/master ...' seems to
always returns with $? == 0 regardless if the znode is present or not
present!  I've checked with Patrick Hunt (ZK committer) and this the
expected behavior.  The only non-zero retcodes are for abnormal exits
(exceptions thrown)

Here's the ZK code I was looking through
https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeperMain.java#L736

https://github.com/apache/zookeeper/blob/release-3.4.3/src/java/main/org/apache/zookeeper/ZooKeeper.java#L980


Thoughts?

Jon.

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]

Reply via email to