[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

Wangda Tan (Jira) Tue, 12 May 2020 11:53:18 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17105675#comment-17105675
 ]


Wangda Tan commented on YARN-10259:
-----------------------------------

Reviewed the patch, it looks good to me, I think it may introduce performance 
regression for large clusters, but I agree this is the right fix otherwise we 
can see issues such as scheduler got stuck. 

Can we move this (and similar logs) to debug: 
{code:java}
LOG.warn("Node : " + node.getNodeID()
    + " does not have sufficient resource for ask : " + pendingAsk
    + " node total capability : " + node.getTotalResource()); {code}
Because for a heterogeneous cluster, we can see this quite often, putting this 
to warn is overkill to me.

So +1 to the patch, please move some logs to debug to make sure we won't see 
the number of logs increased too much after this change. 

> Reserved Containers not allocated from available space of other nodes in 
> CandidateNodeSet in MultiNodePlacement
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10259
>                 URL: https://issues.apache.org/jira/browse/YARN-10259
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.2.0, 3.3.0
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Major
>         Attachments: YARN-10259-001.patch, YARN-10259-002.patch
>
>
> Reserved Containers are not allocated from the available space of other nodes 
> in CandidateNodeSet in MultiNodePlacement. 
> *Repro:*
> 1. MultiNode Placement Enabled.
> 2. Two nodes h1 and h2 with 8GB
> 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets 
> placed in h2.
> 4. Submit app3 AM which is reserved in h1
> 5. Kill app2 which frees space in h2.
> 6. app3 AM never gets ALLOCATED
> RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on 
> h2 as it expects the assignment to be on same node where reservation has 
> happened.
> {code}
> 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] 
> scheduler.SchedulerApplicationAttempt 
> (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt 
> appattempt_1588684773609_0003_000001 reserved container 
> container_1588684773609_0003_01_000001 on node host: h1:1234 #containers=1 
> available=<memory:3072, vCores:7> used=<memory:5120, vCores:1>. This attempt 
> currently has 1 reserved containers at priority 0; currentReservation 
> <memory:5120, vCores:1>
> 2020-05-05 18:49:37,264 INFO  [AsyncDispatcher event handler] 
> fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved 
> container=container_1588684773609_0003_01_000001, on node=host: h1:1234 
> #containers=1 available=<memory:3072, vCores:7> used=<memory:5120, vCores:1> 
> with resource=<memory:5120, vCores:1>
>        RESERVED=[(Application=appattempt_1588684773609_0003_000001; 
> Node=h1:1234; Resource=<memory:5120, vCores:1>)]
>        
> 2020-05-05 18:49:38,283 DEBUG [Time-limited test] 
> allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: 
> node=h2 application=application_1588684773609_0003 priority=0 
> pendingAsk=<per-allocation-resource=<memory:5120, vCores:1>,repeat=1> 
> type=OFF_SWITCH
> 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp 
> (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate 
> from reserved container container_1588684773609_0003_01_000001, but node is 
> not reserved
>        ALLOCATED=[(Application=appattempt_1588684773609_0003_000001; 
> Node=h2:1234; Resource=<memory:5120, vCores:1>)]
> {code}
> After reverting fix of YARN-8127, it works. Attached testcase which 
> reproduces the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement

Reply via email to