[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263334#comment-16263334 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Thanks again [~asuresh]. I have tried to reproduce TestNodeLabelContainerAllocation in my local environment, and the test fails only intermittently. I have found [YARN-7507|https://issues.apache.org/jira/browse/YARN-7507], that is still unresolved, and that reports that the same tests methods testPreferenceOfNeedyPrioritiesUnderSameAppTowardsNodePartitions, testPreferenceOfQueuesTowardsNodePartitions, testQueueMetricsWithLabels and testQueueMetricsWithLabelsOnDefaultLabelNode for TestNodeLabelContainerAllocation are failing in trunk. So it looks like that test failure is not releated to this patch. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch, > YARN-6483.003.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262983#comment-16262983 ] Arun Suresh commented on YARN-6483: --- Thanks [~juanrh]... The patch LGTM - I like the testcases as well. The findbugs warning is extant and I believe is being tracked in another JIRA. I can fix the remaining checkstyles when committing. Can you verify if the lone TestCase failure is related to the patch ? +1 pending the above. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch, > YARN-6483.003.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16262039#comment-16262039 ] Hadoop QA commented on YARN-6483: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 46s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 10 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 56s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 10s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 42s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 10s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 6m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 18s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 4s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 5 new + 763 unchanged - 20 fixed = 768 total (was 783) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 8s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 15s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 11s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 48s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 22m 7s{color} | {color:red} hadoop-yarn-client in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}169m 7s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests |
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261927#comment-16261927 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Hi [~asuresh], I have added a new patch, and also updated the pull request for easy visualization. I have: - provided default implementations for the new methods of NodeReport - reverted the changes to RMNodeDecommissioningEvent to use an Integer again - defined a new enum `org.apache.hadoop.yarn.api.records.NodeUpdateType` that is used as a new optional field for NodeReport, that is only set to a non null value for NodeReports corresponding to node transitions, and it's null for node reports not associated to node transitions like those requested by ClientRMService. I have added assertions for the former case to TestAMRMRPCNodeUpdates, and for the latter to TestClientRMService Thanks again for taking a look! > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch, > YARN-6483.003.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259749#comment-16259749 ] Arun Suresh commented on YARN-6483: --- bq. by "update type" you mean adding an additional field with type `RMAppNodeUpdateType` to `NodeReport`? Yup - but not necessarily the 'RMAppNodeUpdateType' object itself. We should add another public facing enum in the proto file that maps to values in RMAppNodeUPdateType enum and add that as a field in the NodeReport. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16259564#comment-16259564 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Hi [~asuresh], Thanks a lot for all your comments, I'll send a new patch attending all your suggestions. I'd just need a small clarification, by "update type" you mean adding an additional field with type `RMAppNodeUpdateType` to `NodeReport`? Thanks, Juan > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258890#comment-16258890 ] Arun Suresh commented on YARN-6483: --- Thanks for working on this [~juanrh], Apart from the above checkstyle and findbugs warnings that would need to be fixed, Couple of comments. * The NodeReport is unfortunately marked as Public and Stable. This means that older versions of the client should not have to be re-compiled even if they use the newer version of the class - which will happen now, since the new setter and getter are abstract. As a work around, what we do now is have default/no-op implementations in the NodeReport class and override them in the PBImpl class. * Given that we are taking the trouble of notifying the AM now of Decommissioned / decommisioning nodes. Maybe we should include the update type as well in the NodeReport ? * Looks like the only change in {{RMNodeDecommissioningEvent}} is from Integer to int. Can we revert this ? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 3.1.0 >Reporter: Juan Rodríguez Hortalá >Assignee: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16258614#comment-16258614 ] Hadoop QA commented on YARN-6483: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 15m 47s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 8 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 57s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 11s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 11s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api in trunk has 1 extant Findbugs warnings. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 7s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 45s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} cc {color} | {color:green} 6m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 17s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 4s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 33 new + 606 unchanged - 3 fixed = 639 total (was 609) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 3m 9s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 1s{color} | {color:red} The patch has 32 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 6s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 15s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 38s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 2m 53s{color} | {color:red} hadoop-yarn-common in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 4s{color} | {color:green} hadoop-yarn-server-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 14s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 58s{color} | {color:green} hadoop-yarn-client in the
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257452#comment-16257452 ] ASF GitHub Bot commented on YARN-6483: -- Github user juanrh commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151770833 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- I have just attached the patch. Thanks a lot for taking a look! > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch, YARN-6483.002.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257426#comment-16257426 ] ASF GitHub Bot commented on YARN-6483: -- Github user xslogic commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151767367 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- Thanks - looking at the patch.. can you also attach a consolidated patch on the JIRA ? So as to kick Jenkins. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257393#comment-16257393 ] ASF GitHub Bot commented on YARN-6483: -- Github user juanrh commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151762680 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- Makes sense, I have added a new commit for adding a new value `NODE_DECOMMISSIONING` for `NodesListManagerEventType` > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16257379#comment-16257379 ] Brad commented on YARN-6483: I think this would would be useful for another spark change I am working on, SPARK-21097. Looking forward to see how this turns out. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256061#comment-16256061 ] ASF GitHub Bot commented on YARN-6483: -- Github user xslogic commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151554769 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeReport.java --- @@ -53,15 +53,15 @@ public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime) { return newInstance(nodeId, nodeState, httpAddress, rackName, used, -capability, numContainers, healthReport, lastHealthReportTime, null); +capability, numContainers, healthReport, lastHealthReportTime, null, null); } @Private @Unstable public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime, - Set nodeLabels) { + Set nodeLabels, Integer decommissioningTimeout) { --- End diff -- Yeah - sorry, just noticed that. maybe stick with Integer (although I dont like it much). Maybe we can raise another JIRA to fix it properly via your second suggestion. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256045#comment-16256045 ] ASF GitHub Bot commented on YARN-6483: -- Github user xslogic commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151552105 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- I feel we should make intentions explicit - having a separate event type might make the code cleaner and easier to follow - rather than overloading. It could be that the assumption in the testcase is wrong (will have to double check though), in which case - it is perfectly alright to modify the testcasse with the new event. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254382#comment-16254382 ] ASF GitHub Bot commented on YARN-6483: -- Github user juanrh commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151276382 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- I wasn't very sure about using `NODE_USABLE`, but while I was making the changes to follow your suggestion, I have found that in the current code [`TestRMNodeTransitions.testResourceUpdateOnDecommissioningNode`](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java#L994) is asserting that `NodesListManagerEventType.NODE_USABLE` is expected for a node that transitions to `DECOMMISSIONING`. Also, NodesListManagerEventType is transformed into the corresponding `RMAppNodeUpdateType` in `NodesListManager.handle` to build a `RMAppNodeUpdateEvent` that is processed in `RMAppImpl.processNodeUpdate` which just uses the `RMAppNodeUpdateType` for logging. So it looks like it is ok to use `NodesListManagerEventType.NODE_USABLE` for nodes in decommissioning state. Do you still think it's worth adding some additional value for `NodesListManagerEventType` and `RMAppNodeUpdateType`? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254274#comment-16254274 ] ASF GitHub Bot commented on YARN-6483: -- Github user juanrh commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151262546 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeReport.java --- @@ -53,15 +53,15 @@ public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime) { return newInstance(nodeId, nodeState, httpAddress, rackName, used, -capability, numContainers, healthReport, lastHealthReportTime, null); +capability, numContainers, healthReport, lastHealthReportTime, null, null); } @Private @Unstable public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime, - Set nodeLabels) { + Set nodeLabels, Integer decommissioningTimeout) { --- End diff -- This `Integer` comes from [`RMNode.getDecommissioningTimeout()`](https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNode.java#L189) that was already returning an Integer before this patch, because only nodes in DECOMMISSIONING state have an associated decommission timeout, so `null` is used to express absent timeout. In this patch `RMNode.getDecommissioningTimeout()` is used in `DefaultAMSProcessor.handleNodeUpdates` to get the argument `decommissioningTimeout` for `BuilderUtils.newNodeReport`. If we use a `int` for `decommissioningTimeout` in `NodeReport.newInstance` I think we should also use an `int` for the same argument in `BuilderUtils.newNodeReport` for uniformity, which implies a conversion from `null` to `-1` in `DefaultAMSProcessor.handleNodeUpdates`. So I think we should either keep using `Integer decommissioningTimeout` everywhere, enconding absent timeout with `null`, or use `int decommissioningTimeout` everywhere, enconding absent timeout with a negative timeout, which is coherent with `message NodeReportProto` using an unsigned int for `decommissioning_timeout`. What do you think about these 2 alternatives? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254273#comment-16254273 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Thanks a lot for the comments [~asuresh], I'll answer in the pull request > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254131#comment-16254131 ] ASF GitHub Bot commented on YARN-6483: -- Github user xslogic commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151244254 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java --- @@ -1160,6 +1160,11 @@ public void transition(RMNodeImpl rmNode, RMNodeEvent event) { // Update NM metrics during graceful decommissioning. rmNode.updateMetricsForGracefulDecommission(initState, finalState); rmNode.decommissioningTimeout = timeout; + // Notify NodesListManager to notify all RMApp so that each Application Master + // could take any required actions. + rmNode.context.getDispatcher().getEventHandler().handle( + new NodesListManagerEvent( + NodesListManagerEventType.NODE_USABLE, rmNode)); --- End diff -- Hmmm.. don't think we should be sending NODE_USABLE event here.. since technically, it is not usable. Maybe consider creating a new NODE_DECOMMISSIONING event ? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254132#comment-16254132 ] Arun Suresh commented on YARN-6483: --- Left some comments on the pull request. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254103#comment-16254103 ] ASF GitHub Bot commented on YARN-6483: -- Github user xslogic commented on a diff in the pull request: https://github.com/apache/hadoop/pull/289#discussion_r151241885 --- Diff: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/NodeReport.java --- @@ -53,15 +53,15 @@ public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime) { return newInstance(nodeId, nodeState, httpAddress, rackName, used, -capability, numContainers, healthReport, lastHealthReportTime, null); +capability, numContainers, healthReport, lastHealthReportTime, null, null); } @Private @Unstable public static NodeReport newInstance(NodeId nodeId, NodeState nodeState, String httpAddress, String rackName, Resource used, Resource capability, int numContainers, String healthReport, long lastHealthReportTime, - Set nodeLabels) { + Set nodeLabels, Integer decommissioningTimeout) { --- End diff -- Can we use int ? Am not confortable using Integer - given the sometimes arbitrary boxing/unboxing rules. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16253912#comment-16253912 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Hi, Any thoughts on this proposal? [~tgraves] do you think this could be interesting to trigger the blacklisting of executors in decommissioning nodes as in https://github.com/apache/spark/pull/19267/? Thanks, Juan > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242634#comment-16242634 ] ASF GitHub Bot commented on YARN-6483: -- GitHub user juanrh opened a pull request: https://github.com/apache/hadoop/pull/289 [YARN-6483] Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat This is an alternative approach to https://issues.apache.org/jira/browse/YARN-3224 for notifying all affected application masters when a node transitions into the DECOMMISSIONING state. This change modifies the AllocateResponse that the YARN Resource Manager uses to respond to heartbeat request from application masters, to add any node that has transitioned to DECOMMISSIONING state since the last heartbeat to the list of NodeReport objects that is part of the AllocateResponse object. We also add a new field to each NodeReport to add the decommission timeout for DECOMMISSIONING nodes, thus covering the same functionality of the original proposal in YARN-3224. You can merge this pull request into a Git repository by running: $ git pull https://github.com/juanrh/hadoop hortala-YARN-6483 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/hadoop/pull/289.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #289 commit c85b7012b90fff2cf323009664b14db1e998408a Author: Juan Rodriguez HortalaDate: 2017-10-27T23:59:57Z Notify affected Application Masters when a node enters DECOMMISSIONING state commit ed2561e3bfb821f6b981ac5e76761c7ae8475e01 Author: Juan Rodriguez Hortala Date: 2017-11-02T00:41:27Z Add decommission timeout field to NodeReport commit 506f7defbf217c6b8b525a90604e2484573bba8d Author: Juan Rodriguez Hortala Date: 2017-11-02T01:19:49Z fix TestClientRMService adapt test to decommission timeout checks being independent of received heartbeats commit 9a82f314feca6e06e067ceb1b64e0e4e4fad3882 Author: Juan Rodriguez Hortala Date: 2017-11-02T18:07:29Z use xml format for excludes files with timeouts in this version that is the only way to specify a timeout in the excludes file commit e17bc8c132ae3fb70fe273e6bee12956142b5345 Author: Juan Rodriguez Hortala Date: 2017-11-02T20:41:53Z load dynamic timeout like in hadoop trunk replace dynamic conf by using the configuration passed by AdminService cr https://cr.amazon.com/r/7919988/ > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997111#comment-15997111 ] Juan Rodríguez Hortalá commented on YARN-6483: -- Hi [~Naganarasimha], sorry for the late answer. I have attached a patch for commit 103dfae7b9bbcec673dd7c53d38df034b78693a2 that is the current head of branch-2.8 in https://github.com/apache/hadoop. However since I opened this issue I had the opportunity to study the umbrella YARN-914 for decommissioning and I see it already has a sub-task YARN-3224 for notifying the AM of the node decommission. The design proposed on YARN-914 uses the preemption framework described in the umbrella YARN-45. There has been not much activity lately in YARN-3224, and it seems to be blocked on the sub-task YARN-3784 of the preemption umbrella YARN-45. However my patch here is not able to send the decommission timeout, as it is based on `NodeState` that is an enum that doesn't have any field. I don't think it would be a good idea to add a timeout field to `NodeState` because it doesn't make sense for all enum values. So I think the options here are: # Either modify `NodeState` to replace the enum by a class hierarchy, so the subclass for decommissioning might have a timeout. This looks like a significant modification, that anyway diverges from the design of YARN-914 # Or implement the missing issues for the preemption framework and the decommissioning umbrella, in particular YARN-3784 and then YARN-3224, that would replace this issue Which one of those do you think it would be the best option? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.8.0 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483-v1.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972101#comment-15972101 ] Naganarasimha G R commented on YARN-6483: - Hi [~webxcreen], Seems like this information is required for any of the long running services, Would you please share a valid patch so that i can understand your approach ? > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Juan Rodríguez Hortalá > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6483) Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat
[ https://issues.apache.org/jira/browse/YARN-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969326#comment-15969326 ] Hadoop QA commented on YARN-6483: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 13s{color} | {color:red} YARN-6483 does not apply to trunk. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | YARN-6483 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12863488/YARN-6483.patch | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/15641/console | | Powered by | Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes > returned by the Resource Manager as a response to the Application Master > heartbeat > > > Key: YARN-6483 > URL: https://issues.apache.org/jira/browse/YARN-6483 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: Juan Rodríguez Hortalá > Attachments: YARN-6483.patch > > > The DECOMMISSIONING node state is currently used as part of the graceful > decommissioning mechanism to give time for tasks to complete in a node that > is scheduled for decommission, and for reducer tasks to read the shuffle > blocks in that node. Also, YARN effectively blacklists nodes in > DECOMMISSIONING state by assigning them a capacity of 0, to prevent > additional containers to be launched in those nodes, so no more shuffle > blocks are written to the node. This blacklisting is not effective for > applications like Spark, because a Spark executor running in a YARN container > will keep receiving more tasks after the corresponding node has been > blacklisted at the YARN level. We would like to propose a modification of the > YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added > to the list of updated nodes returned by the Resource Manager as a response > to the Application Master heartbeat. This way a Spark application master > would be able to blacklist a DECOMMISSIONING at the Spark level. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org