[ https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246852#comment-13246852 ]
Hari Mankude commented on HDFS-3192: ------------------------------------ bq.Can you explain why it has to restart, instead of just transitioning to standby? What do you mean by "in limbo" here? "in limbo" implies that NN1 thinks that it is active even though NN2 has taken over since it has not tried to access editlogs. So, it is not behaving as standby and keeping up with active. Are you suggesting that ZKFC1 does transitionToStandby() when it loses znode? On an active NN, there is a high probability that it might abort. Also, does transitionToStandby() guarantee that all the active-state threads have quisced? bq.Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a "graceful fence" – ie ask it to self-resign via an RPC. See the tryGracefulFence function in the FailoverController class. I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of all, this is opening up one more channel of communication between NN1 and NN2 and this is subject to various races sequences, split-brain etc. I think self-resign is much safer than trygracefulfence(). So far, I dont see a lack of correctness argument in our discussion. Is my description correct here? > Active NN should exit when it has not received a getServiceStatus() rpc from > ZKFC for timeout secs > -------------------------------------------------------------------------------------------------- > > Key: HDFS-3192 > URL: https://issues.apache.org/jira/browse/HDFS-3192 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: ha, name-node > Reporter: Hari Mankude > Assignee: Hari Mankude > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira