[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439573#comment-13439573 ] Hadoop QA commented on HDFS-3561: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12541971/HDFS-3561-3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/3066//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3066//console This message is automatically generated. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439706#comment-13439706 ] Aaron T. Myers commented on HDFS-3561: -- +1, the latest patch looks good to me. Vinay, can you comment on what testing you did of this patch? Were you able to verify manually that the ZKFC now doesn't retry 45 times during a failover? I'll commit this patch as soon as this question is answered. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439750#comment-13439750 ] Vinay commented on HDFS-3561: - Yes Aaron, We tested the described scenario after setting number of retries to 1. ZKFC retries only one time and gets SocketTimeOutException and throws back the exception to caller without retrying further. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439755#comment-13439755 ] Aaron T. Myers commented on HDFS-3561: -- Great! Thanks for doing that. I'm going to commit this momentarily. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439100#comment-13439100 ] Aaron T. Myers commented on HDFS-3561: -- Sounds good, Vinay. I'll be happy to review/commit the patch once you make these changes. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438463#comment-13438463 ] Vinay commented on HDFS-3561: - Thanks Aaron, I agree with your preference. I will post a new patch for that, which also contains below fixes. {quote}There's no need for the new getGracefulFenceConnectRetries function{quote} Ok, I will remove it. {quote}There's no need for these lines two be on separate lines{quote} This was created by eclipse formatter. Anyway, i will change it. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437160#comment-13437160 ] Aaron T. Myers commented on HDFS-3561: -- That's a good point, Vinay, that the method will only ever be called once, but I still think that creating a copy of the conf object in the FailoverController constructor makes the code a little clearer. I don't think the increased memory usage from having one extra copy of the conf object will be an issue at all. It will also be good from a future-proofing perspective to make sure that any mutations to the passed-in Configuration object don't affect the behavior of a long-lived FailoverController object. Does that make sense? Note that I don't feel super strongly about this; it's just my preference. If you disagree, we can go with what you have here. Two little nits I noticed while taking another look at this patch: # There's no need for the new getGracefulFenceConnectRetries function, since it's only ever called from the constructor of this class. The other two similar methods are necessary because they're called from the ZKFailoverController class. At the very least, the function should be made private. # There's no need for these lines two be on separate lines: {code} +newConf +.setInt( {code} ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Priority: Critical Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432946#comment-13432946 ] Vinay commented on HDFS-3561: - Hi [~atm] any more comments you have on this..? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover, ha Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13414316#comment-13414316 ] Vinay commented on HDFS-3561: - That sounds good. But as of now, in ZKFC, tryGracefulFence() is called by creating the new instance of FC itself. That means setting in Constructor or inside tryGracefulFence() both are same. If we create the copy of conf only for tryGracefulFence() in constructor, FC will hold the copied conf in memory for the local target also, where we never use tryGracefulFence(). i.e. we will hold the duplicate conf instance which is not necessary. Anything I am missing..? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413654#comment-13413654 ] Hadoop QA commented on HDFS-3561: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12536365/HDFS-3561-2.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 javadoc. The javadoc tool did not generate any warning messages. +1 eclipse:eclipse. The patch built with eclipse:eclipse. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in hadoop-common-project/hadoop-common. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/2815//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2815//console This message is automatically generated. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13414093#comment-13414093 ] Aaron T. Myers commented on HDFS-3561: -- Instead of creating a new Configuration object every time tryGracefulFence is called, how about creating a copy of the conf object in the FailoverController constructor, and setting the IPC_CLIENT_CONNECT_MAX_RETRIES there? This would also mean you wouldn't have to have the extra instance variable or method introduced by this patch. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561-2.patch, HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410218#comment-13410218 ] Vinay commented on HDFS-3561: - Thanks Aaron for the suggestion. I have one question here. Shall we set the Connection retries to 1, for all proxy connections for FailOverController, i.e. local target as well as remote target. Any way connecting to Local target should not take much time. What is your suggestion? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410903#comment-13410903 ] Aaron T. Myers commented on HDFS-3561: -- I'd think that we'd only want the lower number of retries when trying to connect to the node that may be down. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409927#comment-13409927 ] Aaron T. Myers commented on HDFS-3561: -- Seems to me like these new configs should not be made specific to the ZKFC, but rather should apply to all failover controllers. Given that, I think we should change the config keys to be named similarly to the other FC graceful connection configs, e.g. ha.failover-controller.graceful-fence.rpc-timeout.ms. Furthermore, we should push down the handling for this into the FailoverController, and not put it in ZKFailoverController. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408255#comment-13408255 ] Uma Maheswara Rao G commented on HDFS-3561: --- How about the configuration key name like this HA_ZKFC_GRACEFUL_FENCE_MAX_RETRIES -- HA_ZKFC_GRACEFUL_FENCE_CONNECTION_RETRIES_KEY ? Also will it be more clearer to separate out the configurations( retries on connection and socket timeout retries)? because they are different configs exposed in common. @Aaron, could you please add your comments here? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408385#comment-13408385 ] Hadoop QA commented on HDFS-3561: - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12535395/HDFS-3561.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2750//console This message is automatically generated. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Affects Versions: 2.0.1-alpha, 3.0.0 Reporter: suja s Assignee: Vinay Attachments: HDFS-3561.patch Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405151#comment-13405151 ] Aaron T. Myers commented on HDFS-3561: -- bq. How we can do shared storage fencing from ZKFC? Many NFS filers have APIs for fencing all subsequent writes or reads from a given IP address. The ZKFC can be configured to run a script which initiates fencing via this method. See the dfs.ha.fencing.methods section on [this page|http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#Configuration_details]. bq. for example if we have shared storage fencing at writer level. Whenever new writer comes, automatically old writer will not be allowed. I'm not sure what you mean by this. {quote} Still my question remains. If we want to fence 100% from ZKFC itself, why shared storage writer level fencing required? How you are handling network down scenario in your clusters? {quote} The ZKFC is always responsible for triggering fencing, so it's not really that we fence 100% from ZKFC itself. Rather, the ZKFC is capable of triggering the fencing process for a given NN via other devices, e.g. the filer or PDU. I hope this clears things up. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405222#comment-13405222 ] Uma Maheswara Rao G commented on HDFS-3561: --- Hi Aaron, Thanks a lot. {code} for example if we have shared storage fencing at writer level. Whenever new writer comes, automatically old writer will not be allowed. I'm not sure what you mean by this. {code} What I mean by this is, take a case of BookKeeper: BookKeeper itself has a fencing logic. Whenever we open a new ledger, automatically it will get fenced. Other clients will not be able to proceed for further writes. I hope similar logic may come in QJM also using some writer level epoch numbers. May be we need to develop a fence method with this clients and fence the things, but only the thing i don't like is I need to to place that client libraries in ZKFC(this step may not be required as NN's respective JM's will any way has to open writer, that writer itself will fence automatically). I hope u understand my point here. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405227#comment-13405227 ] Aaron T. Myers commented on HDFS-3561: -- Ah, yes. Both in the case of BKJM or the QJM, fencing is effectively built into write protocol, so there won't be any need for external fencing options. Regardless, note that I don't object to the main point of this JIRA, i.e. that we should lower the number of retries when attempting to gracefully fence an NN. If someone wants to supply a patch for that, I'll be happy to review/commit it. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405238#comment-13405238 ] Uma Maheswara Rao G commented on HDFS-3561: --- Thanks Aaron :-) {quote} Regardless, note that I don't object to the main point of this JIRA, i.e. that we should lower the number of retries when attempting to gracefully fence an NN. If someone wants to supply a patch for that, I'll be happy to review/commit it. {quote} Sure. I will ask Vinay to upload the patch what we have. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404285#comment-13404285 ] Aaron T. Myers commented on HDFS-3561: -- I think some wires are getting crossed here. Some clarifications: * The ZKFC *always* performs the act of fencing, by executing the configured fencing methods. * There are two fencing methods shipped out of the box: 1) RPC to the active NN to tell it to move to the standby state, 2) ssh to the active NN and `kill -9' the NN process. * You can optionally configure more fencing methods, for example IP-based shared storage fencing, or IP-based STONITH via PDU fencing. * The ZKFC proceeds to execute the various fencing methods in the order they're configured. * One of the stated aims of the HA work was to favor data reliability over availability. So, if we can't guarantee the correctness of the data, we shouldn't cause a state transition. Given all of the above, at least one of the fencing methods *must* succeed before the ZFKC can reasonably cause the standby to transition to active. Imagine a network failure wherein the ZKFCs can no longer reach the active NN, but can reach the standby. If we just try to ping the active NN for a while, without having successfully fenced it, and then transition the standby to active since we can't ping the previous active, then both NNs might be active simultaneously, write to the shared storage and corrupt the FS metadata. This isn't acceptable. As I said previously, I'm very much in favor of lowering the number of graceful fencing retries to a reasonable value. Todd recommended 0 or 1, which sounds fine by me. What I'm not in favor of is changing the ZKFC to ever cause the standby to become active without *some* fencing method succeeding. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404354#comment-13404354 ] Uma Maheswara Rao G commented on HDFS-3561: --- Hi Aaron, Thanks a lot for the explanation. I have few questions. {quote} You can optionally configure more fencing methods, for example IP-based shared storage fencing {quote} ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404358#comment-13404358 ] Uma Maheswara Rao G commented on HDFS-3561: --- Hi Aaron, Thanks a lot for the explanation. I have few questions. I think I am missing some thing here. {quote} You can optionally configure more fencing methods, for example IP-based shared storage fencing {quote} How we can do shared storage fencing from ZKFC? for example if we have shared storage fencing at writer level. Whenever new writer comes, automatically old writer will not be allowed. {quote} One of the stated aims of the HA work was to favor data reliability over availability. So, if we can't guarantee the correctness of the data, we shouldn't cause a state transition. {quote} Still my question remains. If we want to fence 100% from ZKFC itself, why shared storage writer level fencing required? How you are handling network down scenario in your clusters? Regards, Uma ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402978#comment-13402978 ] Vinay commented on HDFS-3561: - {quote}This isn't acceptable. The point of fencing is to ensure that if the previously-active NN returns from appearing to have been down, it doesn't start writing to the shared directory again while the new active is also writing to that directory.{quote} If the shared storage provides good fencing, so that if it is fenced once and not allowing to write anymore from old writer then this situation will never come. If the fail-over not happening due to fencing failure of old active that means network down is not supported in ZKFC..? I think, taking transitioning current standby to Active in the current situation will be the correct behavior. Any thoughts on this..? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402994#comment-13402994 ] Uma Maheswara Rao G commented on HDFS-3561: --- Yes, we have multiple level of fencings. First fencing should happen at ZKFC level. Other place, will have Fencing at Shared storage level. What if fencing logic check for IP reachability? If ip is reachable and not able to fence means fence failed, If IP itself is not reachable then, no way it can kill the remote node. At this case we may need to give chance to next level fencer (shared storage)? If ZKFC itself ensure 100% fence then, why shared storage fencing is required? Am i missing something here? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400487#comment-13400487 ] Vinay commented on HDFS-3561: - During transition, fencing of old active will be done. Here before actually using the fencing method configured, gracefull fencing will be tried. Now zkfc will try to get the proxy of other machine Namenode. since the n/w is down, it is not able to get the connection and it is retrying for 45 times configured using *ipc.client.connect.max.retries.on.timeouts* {code}LOG.info(Should fence: + target); boolean gracefulWorked = new FailoverController(conf, RequestSource.REQUEST_BY_ZKFC).tryGracefulFence(target); if (gracefulWorked) { // It's possible that it's in standby but just about to go into active, // no? Is there some race here? LOG.info(Successfully transitioned + target + to standby + state without fencing); return; }{code} I think in ZKFC case we can reduce the number of retries. Any thoughts? ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400504#comment-13400504 ] Uma Maheswara Rao G commented on HDFS-3561: --- I think we can set retries to 1/2 for avoiding unnecessary actions on small nw fluctuations? or we can set it to 0 as we are already setting the same values in ConfiguredFailoverProxyProvider for failover clients. {code} public static final String DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_KEY = dfs.client.failover.connection.retries; public static final int DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_DEFAULT = 0; public static final String DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_ON_SOCKET_TIMEOUTS_KEY = dfs.client.failover.connection.retries.on.timeouts; public static final int DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT = 0; {code} ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400833#comment-13400833 ] Aaron T. Myers commented on HDFS-3561: -- bq. Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network This isn't acceptable. The point of fencing is to ensure that if the previously-active NN returns from appearing to have been down, it doesn't start writing to the shared directory again while the new active is also writing to that directory. bq. I think we can set retries to 1/2 for avoiding unnecessary actions on small nw fluctuations? or we can set it to 0 as we are already setting the same values in ConfiguredFailoverProxyProvider for failover clients. We set it to 0 in ConfiguredFailoverProxyProvider because we want to trying failing over immediately as the retry mechanism, instead of repeatedly trying to contact a machine that may in fact be completely down. I agree, though, that setting it to a lower number than 45 makes sense in the case of the client in the ZKFC, and perhaps making it configurable separately. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active
[ https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400847#comment-13400847 ] Todd Lipcon commented on HDFS-3561: --- +1 for setting it to 0 or 1 for the graceful fence attempt. ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active Key: HDFS-3561 URL: https://issues.apache.org/jira/browse/HDFS-3561 Project: Hadoop HDFS Issue Type: Bug Components: auto-failover Reporter: suja s Assignee: Vinay Scenario: Active NN on machine1 Standby NN on machine2 Machine1 is isolated from the network (machine1 network cable unplugged) After zk session timeout ZKFC at machine2 side gets notification that NN1 is not there. ZKFC tries to failover NN2 as active. As part of this during fencing it tries to connect to machine1 and kill NN1. (sshfence technique configured) This connection retry happens for 45 times( as it takes ipc.client.connect.max.socket.retries) Also after that standby NN is not able to take over as active (because of fencing failure). Suggestion: If ZKFC is not able to reach other NN for specified time/no of retries it can consider that NN as dead and instruct the other NN to take over as active as there is no chance of the other NN (NN1) retaining its state as active after zk session timeout when its isolated from network From ZKFC log: {noformat} 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s). 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s). 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s). 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s). 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s). 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s). 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s). 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s). 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s). 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s). {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira