Correct david,

Sshfence doesnot handle network unavailability.

Since the JournalNodes ensures that only one NN can write, fencing of old 
active handled Automatically. So configuring fence method to shell(/bin/true) 
should be fine.

Regards,
Vinayakumar B.
From: david marion [mailto:dlmar...@hotmail.com]
Sent: 18 March 2014 20:53
To: user@hadoop.apache.org
Subject: RE: HA NN Failover question

Found this: 
http://grokbase.com/t/cloudera/cdh-user/12anhyr8ht/cdh4-failover-controllers

Then configured dfs.ha.fencing.methods to contain both sshfence and 
shell(/bin/true). Note that the docs for core-default.xml say that the value is 
a list. I tried a comma with no luck. Had to look in the src to find it's 
separated by a newline. Adding shell(/bin/true) allowed it to work successfully.
________________________________
From: dlmar...@hotmail.com<mailto:dlmar...@hotmail.com>
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: RE: HA NN Failover question
Date: Tue, 18 Mar 2014 14:51:25 +0000
I think I found the issue. The ZKFC on the standby NN server tried, and failed, 
to connect to the standby NN when I shutdown the network on the Active NN 
server. I'm getting an exception from the HealthMonitor in the ZKFC log:

WARN org.apache.hadoop.ha.HealthMonitor: Transport-level exception try to 
monitor health of NameNode at <host>/<ip>:<port>.....
INFO org.apache.hadoop.ipc.CLient: Retrying connect to server 
<host>/<ip>:<port>. Already tried 0 time(s); retry policy is .... (the default)

Is it significant that it thinks the address is host/ip, instead of just the 
host or the ip?
________________________________
From: azury...@gmail.com<mailto:azury...@gmail.com>
Subject: Re: HA NN Failover question
Date: Sat, 15 Mar 2014 11:35:20 +0800
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
I suppose NN2 is standby, please check ZKFC2 is alive before stop network on nn1

Sent from my iPhone5s

On 2014年3月15日, at 10:53, dlmarion 
<dlmar...@hotmail.com<mailto:dlmar...@hotmail.com>> wrote:
Apache Hadoop 2.3.0


Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone


-------- Original message --------
From: Azuryy
Date:03/14/2014 10:45 PM (GMT-05:00)
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: HA NN Failover question
Which Hadoop version you used?


Sent from my iPhone5s

On 2014年3月15日, at 9:29, dlmarion 
<dlmar...@hotmail.com<mailto:dlmar...@hotmail.com>> wrote:
Server 1: NN1 and ZKFC1
Server 2: NN2 and ZKFC2
Server 3: Journal1 and ZK1
Server 4: Journal2 and ZK2
Server 5: Journal3 and ZK3
Server 6+: Datanode

All in the same rack. I would expect the ZKFC from the active name node server 
to lose its lock and the other ZKFC to tell the standby namenode that it should 
become active (I’m assuming that’s how it works).

- Dave

From: Juan Carlos [mailto:juc...@gmail.com]
Sent: Friday, March 14, 2014 9:12 PM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: HA NN Failover question

Hi Dave,
How many zookeeper servers do you have and where are them?

Juan Carlos Fernández Rodríguez

El 15/03/2014, a las 01:21, dlmarion 
<dlmar...@hotmail.com<mailto:dlmar...@hotmail.com>> escribió:
I was doing some testing with HA NN today. I set up two NN with active failover 
(ZKFC) using sshfence. I tested that its working on both NN by doing ‘kill -9 
<pid>’ on the active NN. When I did this on the active node, the standby would 
become the active and everything seemed to work. Next, I logged onto the active 
NN and did a ‘service network stop’ to simulate a NIC/network failure. The 
standby did not become the active in this scenario. In fact, it remained in 
standby mode and complained in the log that it could not communicate with (what 
was) the active NN. I was unable to find anything relevant via searches in 
Google in Jira. Does anyone have experience successfully testing this? I’m 
hoping that it is just a configuration problem.

FWIW, when the network was restarted on the active NN, it failed over almost 
immediately.

Thanks,

Dave

Reply via email to