[ https://issues.apache.org/jira/browse/HDFS-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stephen O'Donnell updated HDFS-14579: ------------------------------------- Attachment: HDFS-14579.001.patch > In refreshNodes, avoid performing a DNS lookup while holding the write lock > --------------------------------------------------------------------------- > > Key: HDFS-14579 > URL: https://issues.apache.org/jira/browse/HDFS-14579 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Attachments: HDFS-14579.001.patch > > > When refreshNodes is called on a large cluster, or a cluster where DNS is not > performing well, it can cause the namenode to hang for a long time. This is > because the refreshNodes operation holds the global write lock while it is > running. Most of refreshNodes code is simple and hence fast, but > unfortunately it performs a DNS lookup for each host in the cluster while the > lock is held. > Right now, it calls: > {code} > public void refreshNodes(final Configuration conf) throws IOException { > refreshHostsReader(conf); > namesystem.writeLock(); > try { > refreshDatanodes(); > countSoftwareVersions(); > } finally { > namesystem.writeUnlock(); > } > } > {code} > The line refreshHostsReader(conf); reads the new config file and does a DNS > lookup on each entry - the write lock is not held here. Then the main work is > done here: > {code} > private void refreshDatanodes() { > final Map<String, DatanodeDescriptor> copy; > synchronized (this) { > copy = new HashMap<>(datanodeMap); > } > for (DatanodeDescriptor node : copy.values()) { > // Check if not include. > if (!hostConfigManager.isIncluded(node)) { > node.setDisallowed(true); > } else { > long maintenanceExpireTimeInMS = > hostConfigManager.getMaintenanceExpirationTimeInMS(node); > if (node.maintenanceNotExpired(maintenanceExpireTimeInMS)) { > datanodeAdminManager.startMaintenance( > node, maintenanceExpireTimeInMS); > } else if (hostConfigManager.isExcluded(node)) { > datanodeAdminManager.startDecommission(node); > } else { > datanodeAdminManager.stopMaintenance(node); > datanodeAdminManager.stopDecommission(node); > } > } > node.setUpgradeDomain(hostConfigManager.getUpgradeDomain(node)); > } > } > {code} > All the isIncluded(), isExcluded() methods call node.getResolvedAddress() > which does the DNS lookup. We could probably change things to perform all the > DNS lookups outside of the write lock, and then take the lock and process the > nodes. Also change or overload isIncluded() etc to take the inetAddress > rather than the datanode descriptor. > It would not shorten the time the operation takes to run overall, but it > would move the long duration out of the write lock and avoid blocking the > namenode for the entire time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org