[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268387#comment-17268387 ]
zhuqi edited comment on YARN-10352 at 1/20/21, 6:16 AM: -------------------------------------------------------- [~bibinchundatt] I have rebased it in YARN-10352.009.patch . With some changes YARN-8557 which will skip not running node also. Thanks. was (Author: zhuqi): [~bibinchundatt] I have rebased it with some changes YARN-8557 which will skip not running node also. Thanks. > Skip schedule on not heartbeated nodes in Multi Node Placement > -------------------------------------------------------------- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task > Affects Versions: 3.3.0, 3.4.0 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch, > YARN-10352.009.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org