[ https://issues.apache.org/jira/browse/HDFS-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ajith S reassigned HDFS-8693: ----------------------------- Assignee: Ajith S > refreshNamenodes does not support adding a new standby to a running DN > ---------------------------------------------------------------------- > > Key: HDFS-8693 > URL: https://issues.apache.org/jira/browse/HDFS-8693 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ha > Affects Versions: 2.6.0 > Reporter: Jian Fang > Assignee: Ajith S > Priority: Critical > > I tried to run the following command on a Hadoop 2.6.0 cluster with HA > support > $ hdfs dfsadmin -refreshNamenodes datanode-host:port > to refresh name nodes on data nodes after I replaced one name node with a new > one so that I don't need to restart the data nodes. However, I got the > following error: > refreshNamenodes: HA does not currently support adding a new standby to a > running DN. Please do a rolling restart of DNs to reconfigure the list of NNs. > I checked the 2.6.0 code and the error was thrown by the following code > snippet, which led me to this JIRA. > void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException { > Set<InetSocketAddress> oldAddrs = Sets.newHashSet(); > for (BPServiceActor actor : bpServices) > { oldAddrs.add(actor.getNNSocketAddress()); } > Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs); > if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) > { // Keep things simple for now -- we can implement this at a later date. > throw new IOException( "HA does not currently support adding a new standby to > a running DN. " + "Please do a rolling restart of DNs to reconfigure the list > of NNs."); } > } > Looks like this the refreshNameNodes command is an uncompleted feature. > Unfortunately, the new name node on a replacement is critical for auto > provisioning a hadoop cluster with HDFS HA support. Without this support, the > HA feature could not really be used. I also observed that the new standby > name node on the replacement instance could stuck in safe mode because no > data nodes check in with it. Even with a rolling restart, it may take quite > some time to restart all data nodes if we have a big cluster, for example, > with 4000 data nodes, let alone restarting DN is way too intrusive and it is > not a preferable operation in production. It also increases the chance for a > double failure because the standby name node is not really ready for a > failover in the case that the current active name node fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)