[ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169702#comment-13169702
 ] 

Todd Lipcon commented on HDFS-2185:
-----------------------------------

Here's a design sketch -- I have only done a little bit of implementation but 
nothing really fleshed out yet. So, it might change a bit during the course of 
implementation. But feedback on the general approach would be appreciated!

h3. Goals
- Ensure that only a single NN can be active at a time.
-- Use ZK as a lock manager to satisfy this requirement.
- Perform health monitoring of the active NN to trigger a fail-over should it 
become unhealthy.
- Automatically fail-over in the case that one of the hosts fails (eg 
power/network outage)
- Allow manual (administratively initiated) graceful failover
- Initiate fencing of the previously active NN in the case of non-graceful 
failovers.

h3. Overall design

The ZooKeeper FailoverController (ZKFC) is a separate process/program which 
runs next to each of the HA NameNodes in the cluster. It does not directly 
spawn/supervise the NN JVM process, but rather runs on the same machine and 
communicates with it via localhost RPC.

The ZKFC is designed to be as simple as possible to decrease likelihood of bugs 
which might trigger a false fail-over. It is also designed to use only a very 
small amount of memory, so that it will never have lengthy GC pauses. This 
allows us to set a fairly low time-out on the ZK session in order to detect 
machine failures quickly.

h3. Configuration

The ZKFC needs the following pieces of configuration:
- list of zookeeper servers making up the ZK quorum (fail to start if this is 
not provided)
- host/port for the HAServiceProtocol of the local NN (defaults to 
localhost:<well-known port>)
- "base znode" at which to root all of the znodes used by the process

h3. Nodes in ZK:

Everything should be rooted at the configured base znode. Within that, there 
should be a znode per nameservice ID. Within this {{/base/nameserviceId/}} 
directory, there are the following znodes:

- {{activeLock}} - an ephemeral node taken by the ZKFC before it asks its local 
NN to become active. This acts as a mutex on the active state and also as a 
failure detector.
- {{activeNodeInfo}} - a non-ephemeral node written by the ZKFC after it 
succeeds in taking {{activeLock}}. This should have data like the IPC address, 
HTTP address, etc of the NN.

The {{activeNodeInfo}} is non-ephemeral so that, when a new NN takes over from 
a failed one, it has enough information to fence the previous active in case 
it's still actually running.

h3. Runtime operation states

For simplicity of testing, we can model the ZKFC as a state machine.

h4. LOCAL_NOT_READY

The NN on the local host is down or not responding to RPCs. We start in this 
state.

h4. LOCAL_STANDBY

The NN on the local host is in standby mode and ready to automatically 
transition to active if the former active dies.

h4. LOCAL_ACTIVE

The NN on the local host is running and performing active duty.

h3. Inputs into state machine

Three other classes interact with the state machine:

h4. ZK Controller

A ZK thread connects to ZK and watches for the following events:
- The previously active master has lost its ephemeral node
- The ZK session is lost

h4. User-initiated failover controller

By some means (RPC/signal/HTTP/etc) the user can request that the active NN's 
FC gracefully turn over the active state to a different NN.

h4. Health monitor

A HealthMonitor thread heartbeats continuously to the local NN. It provides an 
event whenever the health state of the NN changes. For example:
- NN has become unhealthy
- Lost contact with NN
- NN is now healthy

h3. Behavior of state machine

h4. LOCAL_NOT_READY state
- System starts here
- When HealthMonitor indicates the local NN is healthy:
-- Transition to LOCAL_STANDBY mode

h4. LOCAL_STANDBY
- On health state change:
-- Transition to NOT_READY state if local NN goes down
- On ZK state change:
-- If the old ZK "active" node was deleted, try to initiate automatic failover
-- If our own ZK session died, reconnect to ZK

h4. Failover process:
- Try to create the "activeLock" ephemeral node in ZK
- If we are unsuccessful, return to LOCAL_STANDBY
- See if there is a "activeNodeInfo" node in ZK. If so:
-- The old NN may still be running (it didn't gracefully shut down).
-- Initiate fencing process.
-- If successful, delete the "activeNodeInfo" node in ZK.
- Create an "activeNodeInfo" with our own information (ie NN IPC address, etc)
- Send IPC to local NN to transitionToActive. If successful, go to LOCAL_ACTIVE

h4. LOCAL_ACTIVE
- On health state change to unhealthy:
-- delete our active lock znode, go to LOCAL_NOT_READY. Another node will fence 
us.
- On ZK connection loss or notice our znode got deleted:
-- another process is probably about to fence us... unless all nodes lost their 
connection, in which case we should "stay the course" and stay active here. 
Still need to figure this out
- On administrative request to failover:
-- IPC to local node to transition to standby
-- Delete our active lock znode
-- Transition to LOCAL_STANDBY with some timeout set, allowing another failover 
controller to take the lock.
                
> HA: ZK-based FailoverController
> -------------------------------
>
>                 Key: HDFS-2185
>                 URL: https://issues.apache.org/jira/browse/HDFS-2185
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Eli Collins
>            Assignee: Todd Lipcon
>
> This jira is for a ZK-based FailoverController daemon. The FailoverController 
> is a separate daemon from the NN that does the following:
> * Initiates leader election (via ZK) when necessary
> * Performs health monitoring (aka failure detection)
> * Performs fail-over (standby to active and active to standby transitions)
> * Heartbeats to ensure the liveness
> It should have the same/similar interface as the Linux HA RM to aid 
> pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to