[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439100#comment-13439100
 ] 

Aaron T. Myers commented on HDFS-3561:
--------------------------------------

Sounds good, Vinay. I'll be happy to review/commit the patch once you make 
these changes.
                
> ZKFC retries for 45 times to connect to other NN during fencing when network 
> between NNs broken and standby Nn will not take over as active 
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3561
>                 URL: https://issues.apache.org/jira/browse/HDFS-3561
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 2.1.0-alpha, 3.0.0
>            Reporter: suja s
>            Assignee: Vinay
>            Priority: Critical
>         Attachments: HDFS-3561-2.patch, HDFS-3561.patch
>
>
> Scenario:
> Active NN on machine1
> Standby NN on machine2
> Machine1 is isolated from the network (machine1 network cable unplugged)
> After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
> not there.
> ZKFC tries to failover NN2 as active.
> As part of this during fencing it tries to connect to machine1 and kill NN1. 
> (sshfence technique configured)
> This connection retry happens for 45 times( as it takes  
> ipc.client.connect.max.socket.retries)
> Also after that standby NN is not able to take over as active (because of 
> fencing failure).
> Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
> retries it can consider that NN as dead and instruct the other NN to take 
> over as active as there is no chance of the other NN (NN1) retaining its 
> state as active after zk session timeout when its isolated from network
> From ZKFC log:
> {noformat}
> 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
> 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
> 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
> 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
> 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
> 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
> 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
> 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
> 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
> 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
> {noformat}
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to