[ https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169702#comment-13169702 ]
Todd Lipcon commented on HDFS-2185: ----------------------------------- Here's a design sketch -- I have only done a little bit of implementation but nothing really fleshed out yet. So, it might change a bit during the course of implementation. But feedback on the general approach would be appreciated! h3. Goals - Ensure that only a single NN can be active at a time. -- Use ZK as a lock manager to satisfy this requirement. - Perform health monitoring of the active NN to trigger a fail-over should it become unhealthy. - Automatically fail-over in the case that one of the hosts fails (eg power/network outage) - Allow manual (administratively initiated) graceful failover - Initiate fencing of the previously active NN in the case of non-graceful failovers. h3. Overall design The ZooKeeper FailoverController (ZKFC) is a separate process/program which runs next to each of the HA NameNodes in the cluster. It does not directly spawn/supervise the NN JVM process, but rather runs on the same machine and communicates with it via localhost RPC. The ZKFC is designed to be as simple as possible to decrease likelihood of bugs which might trigger a false fail-over. It is also designed to use only a very small amount of memory, so that it will never have lengthy GC pauses. This allows us to set a fairly low time-out on the ZK session in order to detect machine failures quickly. h3. Configuration The ZKFC needs the following pieces of configuration: - list of zookeeper servers making up the ZK quorum (fail to start if this is not provided) - host/port for the HAServiceProtocol of the local NN (defaults to localhost:<well-known port>) - "base znode" at which to root all of the znodes used by the process h3. Nodes in ZK: Everything should be rooted at the configured base znode. Within that, there should be a znode per nameservice ID. Within this {{/base/nameserviceId/}} directory, there are the following znodes: - {{activeLock}} - an ephemeral node taken by the ZKFC before it asks its local NN to become active. This acts as a mutex on the active state and also as a failure detector. - {{activeNodeInfo}} - a non-ephemeral node written by the ZKFC after it succeeds in taking {{activeLock}}. This should have data like the IPC address, HTTP address, etc of the NN. The {{activeNodeInfo}} is non-ephemeral so that, when a new NN takes over from a failed one, it has enough information to fence the previous active in case it's still actually running. h3. Runtime operation states For simplicity of testing, we can model the ZKFC as a state machine. h4. LOCAL_NOT_READY The NN on the local host is down or not responding to RPCs. We start in this state. h4. LOCAL_STANDBY The NN on the local host is in standby mode and ready to automatically transition to active if the former active dies. h4. LOCAL_ACTIVE The NN on the local host is running and performing active duty. h3. Inputs into state machine Three other classes interact with the state machine: h4. ZK Controller A ZK thread connects to ZK and watches for the following events: - The previously active master has lost its ephemeral node - The ZK session is lost h4. User-initiated failover controller By some means (RPC/signal/HTTP/etc) the user can request that the active NN's FC gracefully turn over the active state to a different NN. h4. Health monitor A HealthMonitor thread heartbeats continuously to the local NN. It provides an event whenever the health state of the NN changes. For example: - NN has become unhealthy - Lost contact with NN - NN is now healthy h3. Behavior of state machine h4. LOCAL_NOT_READY state - System starts here - When HealthMonitor indicates the local NN is healthy: -- Transition to LOCAL_STANDBY mode h4. LOCAL_STANDBY - On health state change: -- Transition to NOT_READY state if local NN goes down - On ZK state change: -- If the old ZK "active" node was deleted, try to initiate automatic failover -- If our own ZK session died, reconnect to ZK h4. Failover process: - Try to create the "activeLock" ephemeral node in ZK - If we are unsuccessful, return to LOCAL_STANDBY - See if there is a "activeNodeInfo" node in ZK. If so: -- The old NN may still be running (it didn't gracefully shut down). -- Initiate fencing process. -- If successful, delete the "activeNodeInfo" node in ZK. - Create an "activeNodeInfo" with our own information (ie NN IPC address, etc) - Send IPC to local NN to transitionToActive. If successful, go to LOCAL_ACTIVE h4. LOCAL_ACTIVE - On health state change to unhealthy: -- delete our active lock znode, go to LOCAL_NOT_READY. Another node will fence us. - On ZK connection loss or notice our znode got deleted: -- another process is probably about to fence us... unless all nodes lost their connection, in which case we should "stay the course" and stay active here. Still need to figure this out - On administrative request to failover: -- IPC to local node to transition to standby -- Delete our active lock znode -- Transition to LOCAL_STANDBY with some timeout set, allowing another failover controller to take the lock. > HA: ZK-based FailoverController > ------------------------------- > > Key: HDFS-2185 > URL: https://issues.apache.org/jira/browse/HDFS-2185 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Eli Collins > Assignee: Todd Lipcon > > This jira is for a ZK-based FailoverController daemon. The FailoverController > is a separate daemon from the NN that does the following: > * Initiates leader election (via ZK) when necessary > * Performs health monitoring (aka failure detection) > * Performs fail-over (standby to active and active to standby transitions) > * Heartbeats to ensure the liveness > It should have the same/similar interface as the Linux HA RM to aid > pluggability. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira