[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105675#comment-17105675 ]
Wangda Tan commented on YARN-10259: ----------------------------------- Reviewed the patch, it looks good to me, I think it may introduce performance regression for large clusters, but I agree this is the right fix otherwise we can see issues such as scheduler got stuck. Can we move this (and similar logs) to debug: {code:java} LOG.warn("Node : " + node.getNodeID() + " does not have sufficient resource for ask : " + pendingAsk + " node total capability : " + node.getTotalResource()); {code} Because for a heterogeneous cluster, we can see this quite often, putting this to warn is overkill to me. So +1 to the patch, please move some logs to debug to make sure we won't see the number of logs increased too much after this change. > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --------------------------------------------------------------------------------------------------------------- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.0, 3.3.0 > Reporter: Prabhu Joseph > Assignee: Prabhu Joseph > Priority: Major > Attachments: YARN-10259-001.patch, YARN-10259-002.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_000001 reserved container > container_1588684773609_0003_01_000001 on node host: h1:1234 #containers=1 > available=<memory:3072, vCores:7> used=<memory:5120, vCores:1>. This attempt > currently has 1 reserved containers at priority 0; currentReservation > <memory:5120, vCores:1> > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_000001, on node=host: h1:1234 > #containers=1 available=<memory:3072, vCores:7> used=<memory:5120, vCores:1> > with resource=<memory:5120, vCores:1> > RESERVED=[(Application=appattempt_1588684773609_0003_000001; > Node=h1:1234; Resource=<memory:5120, vCores:1>)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=<per-allocation-resource=<memory:5120, vCores:1>,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_000001, but node is > not reserved > ALLOCATED=[(Application=appattempt_1588684773609_0003_000001; > Node=h2:1234; Resource=<memory:5120, vCores:1>)] > {code} > After reverting fix of YARN-8127, it works. Attached testcase which > reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org