[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

Hari Mankude (Commented) (JIRA) Wed, 04 Apr 2012 16:44:47 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246852#comment-13246852
 ]


Hari Mankude commented on HDFS-3192:
------------------------------------

bq.Can you explain why it has to restart, instead of just transitioning to 
standby? What do you mean by "in limbo" here?

"in limbo" implies that NN1 thinks that it is active even though NN2 has taken 
over since it has not tried to access editlogs. So, it is not behaving as 
standby and keeping up with active. Are you suggesting that ZKFC1 does 
transitionToStandby() when it loses znode? On an active NN, there is a high 
probability that it might abort. Also, does transitionToStandby() guarantee 
that all the active-state threads have quisced? 

bq.Before issuing an "uncontrolled abort", the ZKFC2 will always try to do a 
"graceful fence" – ie ask it to self-resign via an RPC. See the 
tryGracefulFence function in the FailoverController class.

I don't think that doing tryGraceFulFence() from NN2 to NN1 is safe. First of 
all, this is opening up one more channel of communication between NN1 and NN2 
and this is subject to various races sequences, split-brain etc. I think 
self-resign is much safer than trygracefulfence(). So far, I dont see a lack of 
correctness argument in our discussion. Is my description correct here?



                
> Active NN should exit when it has not received a getServiceStatus() rpc from 
> ZKFC for timeout secs
> --------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3192
>                 URL: https://issues.apache.org/jira/browse/HDFS-3192
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>            Reporter: Hari Mankude
>            Assignee: Hari Mankude
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3192) Active NN should exit when it has not received a getServiceStatus() rpc from ZKFC for timeout secs

Reply via email to