[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

Aaron T. Myers (JIRA) Mon, 02 Jul 2012 10:35:29 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405151#comment-13405151
 ]


Aaron T. Myers commented on HDFS-3561:
--------------------------------------

bq. How we can do shared storage fencing from ZKFC?

Many NFS filers have APIs for fencing all subsequent writes or reads from a 
given IP address. The ZKFC can be configured to run a script which initiates 
fencing via this method. See the "dfs.ha.fencing.methods" section on [this 
page|http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#Configuration_details].

bq. for example if we have shared storage fencing at writer level. Whenever new 
writer comes, automatically old writer will not be allowed.

I'm not sure what you mean by this.

{quote}
Still my question remains. If we want to fence 100% from ZKFC itself, why 
shared storage writer level fencing required?
How you are handling network down scenario in your clusters?
{quote}

The ZKFC is always responsible for triggering fencing, so it's not really that 
we "fence 100% from ZKFC itself." Rather, the ZKFC is capable of triggering the 
fencing process for a given NN via other devices, e.g. the filer or PDU.

I hope this clears things up.
                
> ZKFC retries for 45 times to connect to other NN during fencing when network 
> between NNs broken and standby Nn will not take over as active 
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3561
>                 URL: https://issues.apache.org/jira/browse/HDFS-3561
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>            Reporter: suja s
>            Assignee: Vinay
>
> Scenario:
> Active NN on machine1
> Standby NN on machine2
> Machine1 is isolated from the network (machine1 network cable unplugged)
> After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
> not there.
> ZKFC tries to failover NN2 as active.
> As part of this during fencing it tries to connect to machine1 and kill NN1. 
> (sshfence technique configured)
> This connection retry happens for 45 times( as it takes  
> ipc.client.connect.max.socket.retries)
> Also after that standby NN is not able to take over as active (because of 
> fencing failure).
> Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
> retries it can consider that NN as dead and instruct the other NN to take 
> over as active as there is no chance of the other NN (NN1) retaining its 
> state as active after zk session timeout when its isolated from network
> From ZKFC log:
> {noformat}
> 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
> 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
> 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
> 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
> 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
> 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
> 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
> 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
> 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
> 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
> {noformat}
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3561) ZKFC retries for 45 times to connect to other NN during fencing when network between NNs broken and standby Nn will not take over as active

Reply via email to