I have narrowed it down to following :

 // Server to handle client requests
    String machineName = DNS.getDefaultHost(conf.get(
        "hbase.regionserver.dns.interface", "default"), conf.get(
        "hbase.regionserver.dns.nameserver", "default"));

I am not using the default interface for RS. I have changed it to 'eth1'
. The machineName is getting set as 'server-2.rfiserve.net.'
Notice the extra period in the end.

Because of above there is an inconsistency in the way zookeeper recorded the
regionserver address and way ServerManager had it in its cached list of
onlineservers.
You will notice the extra dot in zookeeper entry but not in the ServerManager
list.

[zk: localhost:2181(CONNECTED) 3] ls /hbase/rs
[server-2.domain.net.,60020,1310684522383,server-1.domain.net
.,60020,1310680203359]


In ServerManager we do following :

void recordNewServer(HServerInfo info, boolean useInfoLoad,
      HRegionInterface hri) {
    HServerLoad load = useInfoLoad? info.getLoad(): new HServerLoad();
    String serverName = info.getServerName();
    LOG.info("Registering server=" + serverName + ", regionCount=" +
      load.getLoad() + ", userLoad=" + useInfoLoad);
    info.setLoad(load);
    // TODO: Why did we update the RS location ourself?  Shouldn't RS do
this?
    // masterStatus.getZooKeeper().updateRSLocationGetWatch(info, watcher);
    // -- If I understand the question, the RS does not update the location
    // because could be disagreement over locations because of DNS issues;
only
    // master does DNS now -- St.Ack 20100929.
    this.onlineServers.put(serverName, info);
......

In RegionServerTracker after node deletion but pre server expiration a map
lookup happens, it will lookup for server-2.domain.net.,60020,1310684522383
(with an extra period) but actual key in map is
server-2.domain.net,60020,1310684522383
(without the extra period)


  @Override
  public void nodeDeleted(String path) {
    if(path.startsWith(watcher.rsZNode)) {
      String serverName = ZKUtil.getNodeName(path);
      LOG.info("RegionServer ephemeral node deleted, processing expiration
[" +
          serverName + "]");
      HServerInfo hsi = serverManager.getServerInfo(serverName);
      if(hsi == null) {
        LOG.info("No HServerInfo found for " + serverName);
        return;
      }
      serverManager.expireServer(hsi);
    }
  }

The lookup will fail and expiration will never happen. I will get back when
I have more details on why the DNS is being returned as such.
An interesting question is - is it ok to not expire the region server when
we already deleted the entry of the RS from zookeeper.

On Thu, Jul 14, 2011 at 4:32 PM, Shrijeet Paliwal
<shrij...@rocketfuel.com>wrote:

> Hi Everyone,
>
> Hbase Version: 0.90.3
> Hadoop Version: cdh3u0
> 2 region servers, zookeeper quorum managed by hbase.
>
> I was doing some tests and it seemed regions are not getting reassigned by
> master if RS is brought down.
> Here are the steps:
>
> 0. Cluster in a steady state. Pick a random key: k1 belonging to a RS: rs1
> and perform a get from shell. Result comes back fine.
> 1. Bring down rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh --config
> /usr/lib/hbase-0.20/conf/ stop regionserver]
> 2. Wait few second and do a get from shell for k1 again. k1 is still being
> located at rs1 and RetriesExhaustedException occurs.
> 3. Wait few minutes and do a get from shell for k1 again. k1 is still being
> located at rs1 and RetriesExhaustedException occurs.
> 4. Bring up rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh --config
> /usr/lib/hbase-0.20/conf/ start regionserver]
> 5. A get from shell brings back the result just fine.
>
> My hope at step (3) was a reassignment of regions and get should have
> succeeded. 0.90.2 has introduced process to do things more gracefully which
> is great,
> but that (graceful shutdown) is not always possible.
> I have pastebin-ed the relevant logs. Can anyone help me understand the
> scenario?
>
> Hbase Shell after RS brought down
> http://pastebin.com/8bvk5RFV
>
> RS log around time it was brought down
> http://pastebin.com/sgVRVCCj
>
> Zkdump after RS brought down
> http://pastebin.com/meyqCVJ0
>
> Hmaster log around time RS was brought down
> http://pastebin.com/jBGKuy74
>
> hbck after RS brought down
> http://pastebin.com/bxvyTTF5
>
> hbck after RS brought up
> http://pastebin.com/FPxvT9qW
>

Reply via email to