[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278244#comment-17278244 ]
Ahmed Hussein commented on YARN-10352: -------------------------------------- The problem that at any point we have more than one commit for each main Jura-ticket. This makes it hard to go between revisions without breaking the build. I suggest that the fixes are amended to the original commit and close YARN-10611. Like revert and recommit a patch that does not generate errors by Yetus. Please make sure that the patch passes Yetus before merging. > Skip schedule on not heartbeated nodes in Multi Node Placement > -------------------------------------------------------------- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 3.3.0, 3.4.0 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > Priority: Major > Labels: capacityscheduler, multi-node-placement > Fix For: 3.4.0 > > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, > YARN-10352-010.patch, YARN-10352.009.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org