[ 
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246729#comment-13246729
 ] 

Todd Lipcon commented on HDFS-2185:
-----------------------------------

Hi Mingjie. Thanks for taking a look.

The idea for the chain of RPCs is from talking with some folks here who work on 
Hadoop deployment. Their opinion was the following: currently, most of the 
Hadoop client tools are too "thick". For example, in the current manual 
failover implementation, the fencing is run on the admin client. This means 
that you have to run the haadmin command from a machine that has access to all 
of the necessary fencing scripts, key files, etc. That's a little bizarre -- 
you would expect to configure these kinds of things only on the central 
location, not on the client.

So, we decided that it makes sense to push the management of the whole failover 
process into the FCs themselves, and just use a single RPC to kick off the 
whole failover process. This keeps the client "thin".

As for your proposed alternative, here are a few thoughts:

bq. existing manual fo code can be kept mostly
We actually share much of the code already. But, the problem with using the 
existing code exactly as is, is that the failover controllers always expect to 
have complete "control" over the system. If the state of the NNs changes 
underneath the ZKFC, then the state in ZK will become inconsistent with the 
actual state of the system, and it's very easy to get into split brain 
scenarios. So, the idea is that, when auto-failover is enabled, *all* decisions 
must be made by ZKFCs. That way we can make sure the ZK state doesn't get out 
of sync.

bq. although new RPC is added to ZKFC but we don't need them to talk to each 
other. the manual failover logic is all handled at client – haadmin.
As noted above I think this is a con, not a pro, because it requires 
configuring fencing scripts at the client, and likely requiring that the client 
have read-write access to ZK

bq. easier to extend to the case of multiple standby NNs

I think the extension path to multiple standby is actually equally easy with 
both approaches. The solution in the ZKFC-managed implementation is to add a 
new znode like "PreferredActive" and have nodes avoid becoming active unless 
they're listed as preferred. The target node of the failover can just set 
itself to be preferred before asking the other node to cede the lock.


Some other advantages that I probably didn't explain well in the design doc:
- this design is fault tolerant. If the "target" node crashes in the middle of 
the process, then the old active will automatically regain the active state 
after its "rejoin" timeout elapses. With a client-managed setup, a well-meaning 
admin may ^C the process in the middle and leave the system with no active at 
all.
- no need to introduce "disable/enable" to auto-failover. Just having both 
nodes quit the election wouldn't work, since one would end up quitting "before" 
the other, causing a blip where an unnecessary (random) failover occurred. We 
could carefully orchestrate the order of quitting, so the active quits last, 
but I think it still gets complicated.
                
> HA: HDFS portion of ZK-based FailoverController
> -----------------------------------------------
>
>                 Key: HDFS-2185
>                 URL: https://issues.apache.org/jira/browse/HDFS-2185
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0, 0.23.3
>            Reporter: Eli Collins
>            Assignee: Todd Lipcon
>             Fix For: Auto failover (HDFS-3042)
>
>         Attachments: Failover_Controller.jpg, hdfs-2185.txt, hdfs-2185.txt, 
> hdfs-2185.txt, hdfs-2185.txt, hdfs-2185.txt, zkfc-design.pdf, 
> zkfc-design.pdf, zkfc-design.pdf, zkfc-design.pdf, zkfc-design.tex
>
>
> This jira is for a ZK-based FailoverController daemon. The FailoverController 
> is a separate daemon from the NN that does the following:
> * Initiates leader election (via ZK) when necessary
> * Performs health monitoring (aka failure detection)
> * Performs fail-over (standby to active and active to standby transitions)
> * Heartbeats to ensure the liveness
> It should have the same/similar interface as the Linux HA RM to aid 
> pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to