[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167985#comment-14167985 ]
zhihai xu commented on YARN-2641: --------------------------------- Sorry, I didn't describe clearly the second scenario: We need first kill the NM process then call refreshNodes CLI to put the node in the blacklist. To make the refreshNodes CLI work correctly, we need create a file for example "exclude_host.txt" which should have the node name to decommission, then we should set the "yarn.resourcemanager.nodes.exclude-path" to the file "exclude_host.txt". See the code in TestResourceTrackerService.java for the configuration and node list file: {code} conf.set(YarnConfiguration.RM_NODES_EXCLUDE_FILE_PATH, hostFile .getAbsolutePath()); private void writeToHostsFile(String... hosts) throws IOException { if (!hostFile.exists()) { TEMP_DIR.mkdirs(); hostFile.createNewFile(); } FileOutputStream fStream = null; try { fStream = new FileOutputStream(hostFile); for (int i = 0; i < hosts.length; i++) { fStream.write(hosts[i].getBytes()); fStream.write("\n".getBytes()); } } finally { if (fStream != null) { IOUtils.closeStream(fStream); fStream = null; } } } {code} > improve node decommission latency in RM. > ---------------------------------------- > > Key: YARN-2641 > URL: https://issues.apache.org/jira/browse/YARN-2641 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2641.000.patch, YARN-2641.001.patch > > > improve node decommission latency in RM. > Currently the node decommission only happened after RM received nodeHeartbeat > from the Node Manager. The node heartbeat interval is configurable. The > default value is 1 second. > It will be better to do the decommission during RM Refresh(NodesListManager) > instead of nodeHeartbeat(ResourceTrackerService). > This will be a much more serious issue: > After RM is refreshed (refreshNodes), If the NM to be decommissioned is > killed before NM sent heartbeat to RM. The RMNode will never be > decommissioned in RM. The RMNode will only expire in RM after > "yarn.nm.liveness-monitor.expiry-interval-ms"(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)