Jian Fang created HDFS-8693:
-------------------------------

             Summary: refreshNamenodes does not support adding a new standby to 
a running DN
                 Key: HDFS-8693
                 URL: https://issues.apache.org/jira/browse/HDFS-8693
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Jian Fang
            Priority: Critical


I tried to run the following command on a Hadoop 2.6.0 cluster with HA support 

$ hdfs dfsadmin -refreshNamenodes datanode-host:port

to refresh name nodes on data nodes after I replaced one name node with a new 
one so that I don't need to restart the data nodes. However, I got the 
following error:

refreshNamenodes: HA does not currently support adding a new standby to a 
running DN. Please do a rolling restart of DNs to reconfigure the list of NNs.

I checked the 2.6.0 code and the error was thrown by the following code 
snippet, which led me to this JIRA.

void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException {
Set<InetSocketAddress> oldAddrs = Sets.newHashSet();
for (BPServiceActor actor : bpServices)
{ oldAddrs.add(actor.getNNSocketAddress()); }
Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs);
if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty())
{ // Keep things simple for now -- we can implement this at a later date. throw 
new IOException( "HA does not currently support adding a new standby to a 
running DN. " + "Please do a rolling restart of DNs to reconfigure the list of 
NNs."); }
}

Looks like this the refreshNameNodes command is an uncompleted feature. 

Unfortunately, the new name node on a replacement is critical for auto 
provisioning a hadoop cluster with HDFS HA support. Without this support, the 
HA feature could not really be used. I also observed that the new standby name 
node on the replacement instance could stuck in safe mode because no data nodes 
check in with it. Even with a rolling restart, it may take quite some time to 
restart all data nodes if we have a big cluster, for example, with 4000 data 
nodes, let alone restarting DN is way too intrusive and it is not a preferable 
operation in production. It also increases the chance for a double failure 
because the standby name node is not really ready for a failover in the case 
that the current active name node fails. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to