[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277895#comment-16277895 ]
Robert Kanter commented on YARN-6483: ------------------------------------- [~asuresh], did you mean to commit this to branch-3.0? The fix version for this JIRA says 3.1.0. Plus, the {{TestResourceTrackerService#testGracefulDecommissionDefaultTimeoutResolution}} added here is relying on an XML excludes file, which is currently only supported in trunk (YARN-7162), so it fails when run in branch-3.0 because it reads each line of XML as a separate host (e.g. {{<?xml}}, {{<name>host1</name>}}, etc): {noformat} Running org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService Tests run: 35, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 52.706 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService testGracefulDecommissionDefaultTimeoutResolution(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService) Time elapsed: 23.913 sec <<< FAILURE! java.lang.AssertionError: Node state is not correct (timedout) expected:<DECOMMISSIONING> but was:<RUNNING> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:908) at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testGracefulDecommissionDefaultTimeoutResolution(TestResourceTrackerService.java:345) {noformat} > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned to the AM > ------------------------------------------------------------------------------------------------ > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Reporter: Juan Rodríguez Hortalá > Assignee: Juan Rodríguez Hortalá > Fix For: 3.1.0 > > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch, > YARN-6483.003.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org