In our system each server has 2 dns associated with it, one always points
to a private address and the other to public or private depending on the
context.

This issue did not show up in 0.94.x, but is showing up on my new 1.x
cluster.  Basically it goes like this:

1. Regionserver starts up, get's its hostname which returns
`hostA.external` due to our /etc/hosts
2. Regionserver registers itself in zookeeper as `hostA.external`
3. Regionserver reports for duty in to HMaster, which re-resolves the DNS
and returns `hostA.internal`.
4. HMaster registers server as `hostA.internal`
5. Regionserver receives the RegionServerStartupResponse, which contains
`hostA.internal` and uses that for its RPCs
6. HMaster sees a ZNode with `hostA.external`, so thinks it is a
regionserver that hasn't checked in yet, and registers it.

So I think the problem is that step #2 happens before step #5.  You can
clearly see this in the HRegionServer.java run() function.

In 0.94, the `createMyEphemeralNode` function was called within
`handleReportForDutyResponse`.  In 1.x, it happens within `run()` BEFORE
`handleReportForDutyResponse`.


I can work around this by handling /etc/hosts specially for my
regionservers.  We have our /etc/hosts file set up like this for a reason,
but I think I can special case regionservers.

However, it seems like a bug that there are mechanisms built in for the
HMaster to determine the RegionServer hostname, but that these mechanisms
do not account for doubly-registered regionservers due to zookeeper and
hmaster mismatch.

I tried to create a JIRA for this, but either my username no longer has
permissions for creating, or I can't find the place to create them
anymore.  Any help?
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=bbeaudreault

Reply via email to