[ 
https://issues.apache.org/jira/browse/HDFS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405151#comment-13405151
 ] 

Aaron T. Myers commented on HDFS-3561:
--------------------------------------

bq. How we can do shared storage fencing from ZKFC?

Many NFS filers have APIs for fencing all subsequent writes or reads from a 
given IP address. The ZKFC can be configured to run a script which initiates 
fencing via this method. See the "dfs.ha.fencing.methods" section on [this 
page|http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html#Configuration_details].

bq. for example if we have shared storage fencing at writer level. Whenever new 
writer comes, automatically old writer will not be allowed.

I'm not sure what you mean by this.

{quote}
Still my question remains. If we want to fence 100% from ZKFC itself, why 
shared storage writer level fencing required?
How you are handling network down scenario in your clusters?
{quote}

The ZKFC is always responsible for triggering fencing, so it's not really that 
we "fence 100% from ZKFC itself." Rather, the ZKFC is capable of triggering the 
fencing process for a given NN via other devices, e.g. the filer or PDU.

I hope this clears things up.
                
> ZKFC retries for 45 times to connect to other NN during fencing when network 
> between NNs broken and standby Nn will not take over as active 
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-3561
>                 URL: https://issues.apache.org/jira/browse/HDFS-3561
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover
>            Reporter: suja s
>            Assignee: Vinay
>
> Scenario:
> Active NN on machine1
> Standby NN on machine2
> Machine1 is isolated from the network (machine1 network cable unplugged)
> After zk session timeout ZKFC at machine2 side gets notification that NN1 is 
> not there.
> ZKFC tries to failover NN2 as active.
> As part of this during fencing it tries to connect to machine1 and kill NN1. 
> (sshfence technique configured)
> This connection retry happens for 45 times( as it takes  
> ipc.client.connect.max.socket.retries)
> Also after that standby NN is not able to take over as active (because of 
> fencing failure).
> Suggestion: If ZKFC is not able to reach other NN for specified time/no of 
> retries it can consider that NN as dead and instruct the other NN to take 
> over as active as there is no chance of the other NN (NN1) retaining its 
> state as active after zk session timeout when its isolated from network
> From ZKFC log:
> {noformat}
> 2012-06-21 17:46:14,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 22 time(s).
> 2012-06-21 17:46:35,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 23 time(s).
> 2012-06-21 17:46:56,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 24 time(s).
> 2012-06-21 17:47:17,378 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 25 time(s).
> 2012-06-21 17:47:38,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 26 time(s).
> 2012-06-21 17:47:59,382 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 27 time(s).
> 2012-06-21 17:48:20,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 28 time(s).
> 2012-06-21 17:48:41,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 29 time(s).
> 2012-06-21 17:49:02,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 30 time(s).
> 2012-06-21 17:49:23,386 INFO org.apache.hadoop.ipc.Client: Retrying connect 
> to server: HOST-xx-xx-xx-102/xx.xx.xx.102:65110. Already tried 31 time(s).
> {noformat}
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to