Jian Fang created HDFS-8693: ------------------------------- Summary: refreshNamenodes does not support adding a new standby to a running DN Key: HDFS-8693 URL: https://issues.apache.org/jira/browse/HDFS-8693 Project: Hadoop HDFS Issue Type: Bug Reporter: Jian Fang Priority: Critical
I tried to run the following command on a Hadoop 2.6.0 cluster with HA support $ hdfs dfsadmin -refreshNamenodes datanode-host:port to refresh name nodes on data nodes after I replaced one name node with a new one so that I don't need to restart the data nodes. However, I got the following error: refreshNamenodes: HA does not currently support adding a new standby to a running DN. Please do a rolling restart of DNs to reconfigure the list of NNs. I checked the 2.6.0 code and the error was thrown by the following code snippet, which led me to this JIRA. void refreshNNList(ArrayList<InetSocketAddress> addrs) throws IOException { Set<InetSocketAddress> oldAddrs = Sets.newHashSet(); for (BPServiceActor actor : bpServices) { oldAddrs.add(actor.getNNSocketAddress()); } Set<InetSocketAddress> newAddrs = Sets.newHashSet(addrs); if (!Sets.symmetricDifference(oldAddrs, newAddrs).isEmpty()) { // Keep things simple for now -- we can implement this at a later date. throw new IOException( "HA does not currently support adding a new standby to a running DN. " + "Please do a rolling restart of DNs to reconfigure the list of NNs."); } } Looks like this the refreshNameNodes command is an uncompleted feature. Unfortunately, the new name node on a replacement is critical for auto provisioning a hadoop cluster with HDFS HA support. Without this support, the HA feature could not really be used. I also observed that the new standby name node on the replacement instance could stuck in safe mode because no data nodes check in with it. Even with a rolling restart, it may take quite some time to restart all data nodes if we have a big cluster, for example, with 4000 data nodes, let alone restarting DN is way too intrusive and it is not a preferable operation in production. It also increases the chance for a double failure because the standby name node is not really ready for a failover in the case that the current active name node fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)