Stephen O'Donnell created HDFS-14579:
----------------------------------------

             Summary: In refreshNodes, avoid performing a DNS lookup while 
holding the write lock
                 Key: HDFS-14579
                 URL: https://issues.apache.org/jira/browse/HDFS-14579
             Project: Hadoop HDFS
          Issue Type: Improvement
    Affects Versions: 3.3.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


When refreshNodes is called on a large cluster, or a cluster where DNS is not 
performing well, it can cause the namenode to hang for a long time. This is 
because the refreshNodes operation holds the global write lock while it is 
running. Most of refreshNodes code is simple and hence fast, but unfortunately 
it performs a DNS lookup for each host in the cluster while the lock is held. 

Right now, it calls:

{code}
  public void refreshNodes(final Configuration conf) throws IOException {
    refreshHostsReader(conf);
    namesystem.writeLock();
    try {
      refreshDatanodes();
      countSoftwareVersions();
    } finally {
      namesystem.writeUnlock();
    }
  }
{code}

The line refreshHostsReader(conf); reads the new config file and does a DNS 
lookup on each entry - the write lock is not held here. Then the main work is 
done here:

{code}
  private void refreshDatanodes() {
    final Map<String, DatanodeDescriptor> copy;
    synchronized (this) {
      copy = new HashMap<>(datanodeMap);
    }
    for (DatanodeDescriptor node : copy.values()) {
      // Check if not include.
      if (!hostConfigManager.isIncluded(node)) {
        node.setDisallowed(true);
      } else {
        long maintenanceExpireTimeInMS =
            hostConfigManager.getMaintenanceExpirationTimeInMS(node);
        if (node.maintenanceNotExpired(maintenanceExpireTimeInMS)) {
          datanodeAdminManager.startMaintenance(
              node, maintenanceExpireTimeInMS);
        } else if (hostConfigManager.isExcluded(node)) {
          datanodeAdminManager.startDecommission(node);
        } else {
          datanodeAdminManager.stopMaintenance(node);
          datanodeAdminManager.stopDecommission(node);
        }
      }
      node.setUpgradeDomain(hostConfigManager.getUpgradeDomain(node));
    }
  }
{code}

All the isIncluded(), isExcluded() methods call node.getResolvedAddress() which 
does the DNS lookup. We could probably change things to perform all the DNS 
lookups outside of the write lock, and then take the lock and process the 
nodes. Also change or overload isIncluded() etc to take the inetAddress rather 
than the datanode descriptor.

It would not shorten the time the operation takes to run overall, but it would 
move the long duration out of the write lock and avoid blocking the namenode 
for the entire time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to