[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinay updated HDFS-3561: ------------------------ Affects Version/s: 3.0.0 2.0.1-alpha Status: Patch Available (was: Open) > ZKFC retries for 45 times to connect to other NN during fencing when network > between NNs broken and standby Nn will not take over as active > -------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-3561 > URL: https://issues.apache.org/jira/browse/HDFS-3561 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover > Affects Versions: 2.0.1-alpha, 3.0.0 > Reporter: suja s > Assignee: Vinay > Attachments: HDFS-3561.patch > > > Scenario: > Active NN on machine1 > Standby NN on machine2 > Machine1 is isolated from the network (machine1 network cable unplugged) > After zk session timeout ZKFC at machine2 side gets notification that NN1 is > not there. > ZKFC tries to failover NN2 as active. > As part of this during fencing it tries to connect to machine1 and kill NN1. > (sshfence technique configured) > This connection retry happens for 45 times( as it takes > ipc.client.connect.max.socket.retries) > Also after that standby NN is not able to take over as active (because of > fencing failure). > Suggestion: If ZKFC is not able to reach other NN for specified time/no of > retries it can consider that NN as dead and instruct the other NN to take > over as active as there is no chance of the other NN (NN1) retaining its > state as active after zk session timeout when its isolated from network > From ZKFC log: > {noformat} > 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). > 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). > 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). > 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). > 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). > 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). > 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). > 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). > 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). > 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). > {noformat} > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira