[ https://issues.apache.org/jira/browse/HDFS-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866612#comment-16866612 ]
Kihwal Lee commented on HDFS-14579: ----------------------------------- {{node.getResolvedAddress()}} does not cause an actual look up. The ctor of {{InetSocketAddress}} calls {{InetAddress.getByName()}}, which only verifies the format as what's passed in is the string representation of an IP address. {code:java} /** * Determines the IP address of a host, given the host's name. * * <p> The host name can either be a machine name, such as * "{@code java.sun.com}", or a textual representation of its * IP address. If a literal IP address is supplied, only the * validity of the address format is checked. ... */ public static InetAddress getByName(String host) {code} > In refreshNodes, avoid performing a DNS lookup while holding the write lock > --------------------------------------------------------------------------- > > Key: HDFS-14579 > URL: https://issues.apache.org/jira/browse/HDFS-14579 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.3.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Attachments: HDFS-14579.001.patch > > > When refreshNodes is called on a large cluster, or a cluster where DNS is not > performing well, it can cause the namenode to hang for a long time. This is > because the refreshNodes operation holds the global write lock while it is > running. Most of refreshNodes code is simple and hence fast, but > unfortunately it performs a DNS lookup for each host in the cluster while the > lock is held. > Right now, it calls: > {code} > public void refreshNodes(final Configuration conf) throws IOException { > refreshHostsReader(conf); > namesystem.writeLock(); > try { > refreshDatanodes(); > countSoftwareVersions(); > } finally { > namesystem.writeUnlock(); > } > } > {code} > The line refreshHostsReader(conf); reads the new config file and does a DNS > lookup on each entry - the write lock is not held here. Then the main work is > done here: > {code} > private void refreshDatanodes() { > final Map<String, DatanodeDescriptor> copy; > synchronized (this) { > copy = new HashMap<>(datanodeMap); > } > for (DatanodeDescriptor node : copy.values()) { > // Check if not include. > if (!hostConfigManager.isIncluded(node)) { > node.setDisallowed(true); > } else { > long maintenanceExpireTimeInMS = > hostConfigManager.getMaintenanceExpirationTimeInMS(node); > if (node.maintenanceNotExpired(maintenanceExpireTimeInMS)) { > datanodeAdminManager.startMaintenance( > node, maintenanceExpireTimeInMS); > } else if (hostConfigManager.isExcluded(node)) { > datanodeAdminManager.startDecommission(node); > } else { > datanodeAdminManager.stopMaintenance(node); > datanodeAdminManager.stopDecommission(node); > } > } > node.setUpgradeDomain(hostConfigManager.getUpgradeDomain(node)); > } > } > {code} > All the isIncluded(), isExcluded() methods call node.getResolvedAddress() > which does the DNS lookup. We could probably change things to perform all the > DNS lookups outside of the write lock, and then take the lock and process the > nodes. Also change or overload isIncluded() etc to take the inetAddress > rather than the datanode descriptor. > It would not shorten the time the operation takes to run overall, but it > would move the long duration out of the write lock and avoid blocking the > namenode for the entire time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org