Allan Yang created HBASE-17718:
----------------------------------

             Summary: Difference between RS's servername and its ephemeral node 
cause SSH stop working
                 Key: HBASE-17718
                 URL: https://issues.apache.org/jira/browse/HBASE-17718
             Project: HBase
          Issue Type: Bug
    Affects Versions: 1.1.8, 1.2.4, 2.0.0
            Reporter: Allan Yang
            Assignee: Allan Yang



After HBASE-9593, RS put up an ephemeral node in ZK before reporting for duty. 
But if the hosts config (/etc/hosts) is different between master and RS, RS's 
serverName can be different from the one stored the ephemeral zk node. The 
email metioned in HBASE-13753 
(http://mail-archives.apache.org/mod_mbox/hbase-user/201505.mbox/%3CCANZDn9ueFEEuZMx=pZdmtLsdGLyZz=rrm1N6EQvLswYc1z-H=g...@mail.gmail.com%3E)
 is exactly what happened in our production env. 

But what the email didn't point out is that the difference between serverName 
in RS and zk node can cause SSH stop to work. as we can see from the code in 
{{RegionServerTracker}}
{code}
  @Override
  public void nodeDeleted(String path) {
    if (path.startsWith(watcher.rsZNode)) {
      String serverName = ZKUtil.getNodeName(path);
      LOG.info("RegionServer ephemeral node deleted, processing expiration [" +
        serverName + "]");
      ServerName sn = ServerName.parseServerName(serverName);
      if (!serverManager.isServerOnline(sn)) {
        LOG.warn(serverName.toString() + " is not online or isn't known to the 
master."+
         "The latter could be caused by a DNS misconfiguration.");
        return;
      }
      remove(sn);
      this.serverManager.expireServer(sn);
    }
  }
{code}
The server will not be processed by SSH/ServerCrashProcedure. The regions on 
this server will not been assigned again until master restart or failover.
I know HBASE-9593 was to fix the issue if RS report to duty and crashed before 
it can put up a zk node. It is a very rare case. But The issue I metioned can 
happened more often(due to DNS, config, etc.) and have more severe consequence.

So here I offer some solutions to discuss:
1. Revert HBASE-9593 from all branches, Andrew Purtell has reverted it in 
branch-0.98
2. Abort RS if master return a different name, otherwise SSH can't work properly
3. Master receive whatever servername reported by RS and don't change it.

 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to