[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439573#comment-13439573
 ] 

Hadoop QA commented on HDFS-3561:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12541971/HDFS-3561-3.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/3066//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/3066//console

This message is automatically generated.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-22 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439706#comment-13439706
 ] 

Aaron T. Myers commented on HDFS-3561:
--

+1, the latest patch looks good to me.

Vinay, can you comment on what testing you did of this patch? Were you able to 
verify manually that the ZKFC now doesn't retry 45 times during a failover? 
I'll commit this patch as soon as this question is answered.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-22 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439750#comment-13439750
 ] 

Vinay commented on HDFS-3561:
-

Yes Aaron,
We tested the described scenario after setting number of retries to 1.
ZKFC retries only one time and gets SocketTimeOutException and throws back the 
exception to caller without retrying further.


 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-22 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439755#comment-13439755
 ] 

Aaron T. Myers commented on HDFS-3561:
--

Great! Thanks for doing that.

I'm going to commit this momentarily.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561-3.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-21 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13439100#comment-13439100
 ] 

Aaron T. Myers commented on HDFS-3561:
--

Sounds good, Vinay. I'll be happy to review/commit the patch once you make 
these changes.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-20 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13438463#comment-13438463
 ] 

Vinay commented on HDFS-3561:
-

Thanks Aaron,

I agree with your preference. I will post a new patch for that, which also 
contains below fixes.
{quote}There's no need for the new getGracefulFenceConnectRetries 
function{quote}
Ok, I will remove it.

{quote}There's no need for these lines two be on separate lines{quote}
This was created by eclipse formatter. Anyway, i will change it.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-17 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13437160#comment-13437160
 ] 

Aaron T. Myers commented on HDFS-3561:
--

That's a good point, Vinay, that the method will only ever be called once, but 
I still think that creating a copy of the conf object in the FailoverController 
constructor makes the code a little clearer. I don't think the increased memory 
usage from having one extra copy of the conf object will be an issue at all. It 
will also be good from a future-proofing perspective to make sure that any 
mutations to the passed-in Configuration object don't affect the behavior of a 
long-lived FailoverController object. Does that make sense? Note that I don't 
feel super strongly about this; it's just my preference. If you disagree, we 
can go with what you have here.

Two little nits I noticed while taking another look at this patch:

# There's no need for the new getGracefulFenceConnectRetries function, since 
it's only ever called from the constructor of this class. The other two similar 
methods are necessary because they're called from the ZKFailoverController 
class. At the very least, the function should be made private.
# There's no need for these lines two be on separate lines:
{code}
+newConf
+.setInt(
{code}

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
Priority: Critical
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-08-13 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13432946#comment-13432946
 ] 

Vinay commented on HDFS-3561:
-

Hi [~atm] any more comments you have on this..? 

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha
Affects Versions: 2.1.0-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-14 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13414316#comment-13414316
 ] 

Vinay commented on HDFS-3561:
-

That sounds good. But as of now, in ZKFC, tryGracefulFence() is called by 
creating the new instance of FC itself. That means setting in Constructor or 
inside tryGracefulFence() both are same.
If we create the copy of conf only for tryGracefulFence() in constructor, FC 
will hold the copied conf in memory for the local target also, where we never 
use tryGracefulFence(). i.e. we will hold the duplicate conf instance which is 
not necessary.

Anything I am missing..?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13413654#comment-13413654
 ] 

Hadoop QA commented on HDFS-3561:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12536365/HDFS-3561-2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse.  The patch built with eclipse:eclipse.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in 
hadoop-common-project/hadoop-common.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-HDFS-Build/2815//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2815//console

This message is automatically generated.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-13 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13414093#comment-13414093
 ] 

Aaron T. Myers commented on HDFS-3561:
--

Instead of creating a new Configuration object every time tryGracefulFence is 
called, how about creating a copy of the conf object in the FailoverController 
constructor, and setting the IPC_CLIENT_CONNECT_MAX_RETRIES there? This would 
also mean you wouldn't have to have the extra instance variable or method 
introduced by this patch.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561-2.patch, HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-10 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410218#comment-13410218
 ] 

Vinay commented on HDFS-3561:
-

Thanks Aaron for the suggestion.

I have one question here.
Shall we set the Connection retries to 1, for all proxy connections for 
FailOverController, i.e. local target as well as remote target.
Any way connecting to Local target should not take much time.

What is your suggestion?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-10 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13410903#comment-13410903
 ] 

Aaron T. Myers commented on HDFS-3561:
--

I'd think that we'd only want the lower number of retries when trying to 
connect to the node that may be down.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-09 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13409927#comment-13409927
 ] 

Aaron T. Myers commented on HDFS-3561:
--

Seems to me like these new configs should not be made specific to the ZKFC, but 
rather should apply to all failover controllers. Given that, I think we should 
change the config keys to be named similarly to the other FC graceful 
connection configs, e.g. 
ha.failover-controller.graceful-fence.rpc-timeout.ms. Furthermore, we should 
push down the handling for this into the FailoverController, and not put it in 
ZKFailoverController.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-06 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408255#comment-13408255
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

How about the configuration key name like this
HA_ZKFC_GRACEFUL_FENCE_MAX_RETRIES -- 
HA_ZKFC_GRACEFUL_FENCE_CONNECTION_RETRIES_KEY ?

Also will it be more clearer to separate out the configurations( retries on 
connection and socket timeout retries)? because they are different configs 
exposed in common.

@Aaron, could you please add your comments here?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13408385#comment-13408385
 ] 

Hadoop QA commented on HDFS-3561:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12535395/HDFS-3561.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/2750//console

This message is automatically generated.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Affects Versions: 2.0.1-alpha, 3.0.0
Reporter: suja s
Assignee: Vinay
 Attachments: HDFS-3561.patch


 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-02 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405151#comment-13405151
 ] 

Aaron T. Myers commented on HDFS-3561:
--

bq. How we can do shared storage fencing from ZKFC?

Many NFS filers have APIs for fencing all subsequent writes or reads from a 
given IP address. The ZKFC can be configured to run a script which initiates 
fencing via this method. See the dfs.ha.fencing.methods section on [this 
page|http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#Configuration_details].

bq. for example if we have shared storage fencing at writer level. Whenever new 
writer comes, automatically old writer will not be allowed.

I'm not sure what you mean by this.

{quote}
Still my question remains. If we want to fence 100% from ZKFC itself, why 
shared storage writer level fencing required?
How you are handling network down scenario in your clusters?
{quote}

The ZKFC is always responsible for triggering fencing, so it's not really that 
we fence 100% from ZKFC itself. Rather, the ZKFC is capable of triggering the 
fencing process for a given NN via other devices, e.g. the filer or PDU.

I hope this clears things up.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-02 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405222#comment-13405222
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

Hi Aaron, Thanks a lot.

{code}
for example if we have shared storage fencing at writer level. Whenever new 
writer comes, automatically old writer will not be allowed.

I'm not sure what you mean by this.
{code}
What I mean by this is, take a case of BookKeeper:
BookKeeper itself has a fencing logic. Whenever we open a new ledger, 
automatically it will get fenced. Other clients will not be able to proceed for 
further writes.
I hope similar logic may come in QJM also using some writer level epoch numbers.

May be we need to develop a fence method with this clients and fence the 
things, but only the thing i don't like is I need to to place that client 
libraries in ZKFC(this step may not be required as NN's respective JM's will 
any way has to open writer, that writer itself will fence automatically). I 
hope u understand my point here.


  


 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-02 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405227#comment-13405227
 ] 

Aaron T. Myers commented on HDFS-3561:
--

Ah, yes. Both in the case of BKJM or the QJM, fencing is effectively built into 
write protocol, so there won't be any need for external fencing options.

Regardless, note that I don't object to the main point of this JIRA, i.e. that 
we should lower the number of retries when attempting to gracefully fence an 
NN. If someone wants to supply a patch for that, I'll be happy to review/commit 
it.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-07-02 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405238#comment-13405238
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

Thanks Aaron :-)

{quote}
Regardless, note that I don't object to the main point of this JIRA, i.e. that 
we should lower the number of retries when attempting to gracefully fence an 
NN. If someone wants to supply a patch for that, I'll be happy to review/commit 
it.
{quote}
Sure. I will ask Vinay to upload the patch what we have.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-29 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404285#comment-13404285
 ] 

Aaron T. Myers commented on HDFS-3561:
--

I think some wires are getting crossed here. Some clarifications:

* The ZKFC *always* performs the act of fencing, by executing the configured 
fencing methods.
* There are two fencing methods shipped out of the box: 1) RPC to the active NN 
to tell it to move to the standby state, 2) ssh to the active NN and `kill -9' 
the NN process.
* You can optionally configure more fencing methods, for example IP-based 
shared storage fencing, or IP-based STONITH via PDU fencing.
* The ZKFC proceeds to execute the various fencing methods in the order they're 
configured.
* One of the stated aims of the HA work was to favor data reliability over 
availability. So, if we can't guarantee the correctness of the data, we 
shouldn't cause a state transition.

Given all of the above, at least one of the fencing methods *must* succeed 
before the ZFKC can reasonably cause the standby to transition to active. 
Imagine a network failure wherein the ZKFCs can no longer reach the active NN, 
but can reach the standby. If we just try to ping the active NN for a while, 
without having successfully fenced it, and then transition the standby to 
active since we can't ping the previous active, then both NNs might be active 
simultaneously, write to the shared storage and corrupt the FS metadata. This 
isn't acceptable.

As I said previously, I'm very much in favor of lowering the number of graceful 
fencing retries to a reasonable value. Todd recommended 0 or 1, which sounds 
fine by me. What I'm not in favor of is changing the ZKFC to ever cause the 
standby to become active without *some* fencing method succeeding.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA 

[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-29 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404354#comment-13404354
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

Hi Aaron, Thanks a lot for the explanation.

I have few questions.

{quote}
You can optionally configure more fencing methods, for example IP-based shared 
storage fencing
{quote}


 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-29 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13404358#comment-13404358
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

Hi Aaron, Thanks a lot for the explanation.

I have few questions. I think I am missing some thing here.

{quote}
You can optionally configure more fencing methods, for example IP-based 
shared storage fencing
{quote}
How we can do shared storage fencing from ZKFC? for example if we have shared 
storage fencing at writer level. Whenever new writer comes, automatically old 
writer will not be allowed.

{quote}
One of the stated aims of the HA work was to favor data reliability over 
availability. So, if we can't guarantee the correctness of the data, we 
shouldn't cause a state transition.
{quote}
Still my question remains. If we want to fence 100% from ZKFC itself, why 
shared storage writer level fencing required?

How you are handling network down scenario in your clusters?

Regards,
Uma



 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-28 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402978#comment-13402978
 ] 

Vinay commented on HDFS-3561:
-

{quote}This isn't acceptable. The point of fencing is to ensure that if the 
previously-active NN returns from appearing to have been down, it doesn't start 
writing to the shared directory again while the new active is also writing to 
that directory.{quote}

If the shared storage provides good fencing, so that if it is fenced once and 
not allowing to write anymore from old writer then this situation will never 
come.

If the fail-over not happening due to fencing failure  of old active that means 
network down is not supported in ZKFC..?

I think, taking transitioning current standby to Active in the current 
situation will be the correct behavior.

Any thoughts on this..?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-28 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402994#comment-13402994
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

Yes, we have multiple level of fencings.

First fencing should happen at ZKFC level. Other place, will have Fencing at 
Shared storage level.
What if fencing logic check for IP reachability? If ip is reachable and not 
able to fence means fence failed, If IP itself is not reachable then, no way it 
can kill the remote node. At this case we may need to give chance to next level 
fencer (shared storage)?
If ZKFC itself ensure 100% fence then, why shared storage fencing is required? 
Am i missing something here?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-25 Thread Vinay (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400487#comment-13400487
 ] 

Vinay commented on HDFS-3561:
-

During transition, fencing of old active will be done.

Here before actually using the fencing method configured, gracefull fencing 
will be tried. Now zkfc will try to get the proxy of other machine Namenode. 
since the n/w is down, it is not able to get the connection and it is retrying 
for 45 times configured using *ipc.client.connect.max.retries.on.timeouts*
{code}LOG.info(Should fence:  + target);
boolean gracefulWorked = new FailoverController(conf,
RequestSource.REQUEST_BY_ZKFC).tryGracefulFence(target);
if (gracefulWorked) {
  // It's possible that it's in standby but just about to go into active,
  // no? Is there some race here?
  LOG.info(Successfully transitioned  + target +  to standby  +
  state without fencing);
  return;
}{code}

I think in ZKFC case we can reduce the number of retries.

Any thoughts?

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-25 Thread Uma Maheswara Rao G (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400504#comment-13400504
 ] 

Uma Maheswara Rao G commented on HDFS-3561:
---

I think we can set retries to 1/2 for avoiding unnecessary actions on small nw 
fluctuations? or we can set it to 0 as we are already setting the same values 
in ConfiguredFailoverProxyProvider for failover clients.

{code}
 public static final String  DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_KEY = 
dfs.client.failover.connection.retries;
  public static final int DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_DEFAULT = 
0;
  public static final String  
DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_ON_SOCKET_TIMEOUTS_KEY = 
dfs.client.failover.connection.retries.on.timeouts;
  public static final int 
DFS_CLIENT_FAILOVER_CONNECTION_RETRIES_ON_SOCKET_TIMEOUTS_DEFAULT = 0;
{code}


 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-25 Thread Aaron T. Myers (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400833#comment-13400833
 ] 

Aaron T. Myers commented on HDFS-3561:
--

bq. Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
retries it can consider that NN as dead and instruct the other NN to take over 
as active as there is no chance of the other NN (NN1) retaining its state as 
active after zk session timeout when its isolated from network

This isn't acceptable. The point of fencing is to ensure that if the 
previously-active NN returns from appearing to have been down, it doesn't start 
writing to the shared directory again while the new active is also writing to 
that directory.

bq. I think we can set retries to 1/2 for avoiding unnecessary actions on small 
nw fluctuations? or we can set it to 0 as we are already setting the same 
values in ConfiguredFailoverProxyProvider for failover clients.

We set it to 0 in ConfiguredFailoverProxyProvider because we want to trying 
failing over immediately as the retry mechanism, instead of repeatedly trying 
to contact a machine that may in fact be completely down.

I agree, though, that setting it to a lower number than 45 makes sense in the 
case of the client in the ZKFC, and perhaps making it configurable separately.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

2012-06-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13400847#comment-13400847
 ] 

Todd Lipcon commented on HDFS-3561:
---

+1 for setting it to 0 or 1 for the graceful fence attempt.

 ZKFC retries for 45 times to connect to other NN during fencing when network 
 between NNs broken and standby Nn will not take over as active 
 

 Key: HDFS-3561
 URL: https://issues.apache.org/jira/browse/HDFS-3561
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover
Reporter: suja s
Assignee: Vinay

 Scenario:
 Active NN on machine1
 Standby NN on machine2
 Machine1 is isolated from the network (machine1 network cable unplugged)
 After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
 not there.
 ZKFC tries to failover NN2 as active.
 As part of this during fencing it tries to connect to machine1 and kill NN1. 
 (sshfence technique configured)
 This connection retry happens for 45 times( as it takes  
 ipc.client.connect.max.socket.retries)
 Also after that standby NN is not able to take over as active (because of 
 fencing failure).
 Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
 retries it can consider that NN as dead and instruct the other NN to take 
 over as active as there is no chance of the other NN (NN1) retaining its 
 state as active after zk session timeout when its isolated from network
 From ZKFC log:
 {noformat}
 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
 {noformat}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira