[ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603308#comment-14603308 ]
Rohith Sharma K S commented on YARN-3849: ----------------------------------------- The below is the log trace for the issue. In our cluster, there are 3 NodeManager and each with resource {{<memory:327680, vCores:35>}}. Total cluster resource is {{clusterResource: <memory:983040, vCores:105>}} with CapacityScheduler configured queue's with name *default* and *QueueA*. # Application app-1 is submitted to queue default and containers are started running the applications with 10 containers,each with {{resource: <memory:1024, vCores:10>}}. so total used is {{usedResources=<memory:10240, vCores:91>}} {noformat} default user=spark used=<memory:10240, vCores:91> numContainers=10 headroom = <memory:1024, vCores:10> user-resources=<memory:10240, vCores:91> Re-sorting assigned queue: root.default stats: default: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:10240, vCores:91>, usedCapacity=1.7333333, absoluteUsedCapacity=0.8666667, numApps=1, numContainers=10 {noformat} *NOTE : Resource allocation is by CPU DOMINANT* After 10 container running, available NodeManagers memory is {noformat} linux-174, available: <memory:323584, vCores:4> linux-175, available: <memory:324608, vCores:5> linux-223, available: <memory:324608, vCores:5> {noformat} # Application app-2 is submitted to QueueA. ApplicationMaster container started running and NodeManager memory is {{available: <memory:322560, vCores:3>}} {noformat} Assigned container container_1435072598099_0002_01_000001 of capacity <memory:1024, vCores:1> on host linux-174:26009, which has 5 containers, <memory:5120, vCores:32> used and <memory:322560, vCores:3> available after allocation | SchedulerNode.java:154 linux-174, available: <memory:322560, vCores:3> {noformat} # the preemption policy does the below calculation {noformat} 2015-06-23 23:20:51,127 NAME: QueueA CUR: <memory:0, vCores:0> PEN: <memory:0, vCores:0> GAR: <memory:491520, vCores:52> NORM: NaN IDEAL_ASSIGNED: <memory:0, vCores:0> IDEAL_PREEMPT: <memory:0, vCores:0> ACTUAL_PREEMPT: <memory:0, vCores:0> UNTOUCHABLE: <memory:0, vCores:0> PREEMPTABLE: <memory:0, vCores:0> 2015-06-23 23:20:51,128 NAME: default CUR: <memory:851968, vCores:91> PEN: <memory:0, vCores:0> GAR: <memory:491520, vCores:52> NORM: 1.0 IDEAL_ASSIGNED: <memory:851968, vCores:91> IDEAL_PREEMPT: <memory:0, vCores:0> ACTUAL_PREEMPT: <memory:0, vCores:0> UNTOUCHABLE: <memory:0, vCores:0> PREEMPTABLE: <memory:360448, vCores:39> {noformat} In the above log , observe for the queue default *CUR is <memory:851968, vCores:91>*, but actually *usedResources=<memory:10240, vCores:91>*. Here, only CPU is matching but not MEMORY. The CUR calculation is done below formula #* CUR= {{clusterResource: <memory:983040, vCores:105>}} * {{absoluteUsedCapacity(0.8)}} = {{<memory:851968, vCores:91>}} #* GAR= {{clusterResource: <memory:983040, vCores:105>}} * {{absoluteCapacity(0.5)}} = {{ <memory:491520, vCores:52>}} #* PREEMPTABLE= GAR - CUR = {{<memory:360448, vCores:39>}} # App-2 request for the containers with {{resource: <memory:1024, vCores:10>}}. So, the preemption cycle finds that how much memory toBePreempt {noformat} 2015-06-23 23:21:03,131 | DEBUG | SchedulingMonitor (ProportionalCapacityPreemptionPolicy) | 1435072863131: NAME: default CUR: <memory:851968, vCores:91> PEN: <memory:0, vCores:0> GAR: <memory:491520, vCores:52> NORM: NaN IDEAL_ASSIGNED: <memory:491520, vCores:52> IDEAL_PREEMPT: <memory:97043, vCores:10> ACTUAL_PREEMPT: <memory:0, vCores:0> UNTOUCHABLE: <memory:0, vCores:0> PREEMPTABLE: <memory:360448, vCores:39> {noformat} Observe that *IDEAL_PREEMPT: <memory:97043, vCores:10>*, but app-2 in queue QueueA required only 10 CPU resource to be preempt, but memory to be preempt is 97043 but memory sufficiently available. Below is the calculations which does IDEAL_PREMPT, #* totalPreemptionAllowed = clusterResource: <memory:983040, vCores:105> * 0.1 = <memory:98304, vCores:10.5> #* totPreemptionNeeded = CUR - IDEAL_ASSIGNED = CUR: <memory:851968, vCores:91> #* scalingFactor = Resources.divide(drc, <memory:491520, vCores:52>, <memory:98304, vCores:10.5>, <memory:851968, vCores:91>); scalingFactor = 0.114285715 #* toBePreempted = CUR: <memory:851968, vCores:91> * scalingFactor(0.1139045128455529) = <memory:97368, vCores:10> {{resource-to-obtain = <memory:97043, vCores:10>}} *So the problem is in either of the below steps* # As [~sunilg] said, usedResources=<memory:10240, vCores:91> but preemption policy calculate wrongly that current used capacity as {{<memory:851968, vCores:91>}}. This is mainly becaue preemption policy is using absoluteCapacity for calculating for Current usage which always gives wrong result for one of the resources in DominantResourceAllocator used. I think, fraction should not be used which caused problem in DRC(Multi dimentional resources) instead we should be usedResource from CSQueue. # Even bypassing above step-1, toBePreempted calculated as resource-to-obtain: <memory:97043, vCores:10>. When a container marked for preemption, preemption policy subtract the marked container resources. I.e in the above log, resource-to-obtain will become *<memory:96043, vCores:0>* since each container memory is <1gb,10cores>. On next container marking, MEMORY has become DOMINANT and policy tries to fullfil memory i.e 96GB even CPU is fulfilled. The dominant change i.e scheduler allocates container with CPU dominant, but preemption policy going for MEMORY dominant causing the problem. This allows kills all the NON-AM containers. *And don't think that problem is only killing all the NON-AM containers but it continues loop:-( i.e when app-2 starts running containers in QueueA, app-1 ask for container request which preemption policy kill all the NON-Am containers from app-1. This repeats for ever, and both applications kills the tasks each others in loop which both applications never completes at all* > Too much of preemption activity causing continuos killing of containers > across queues > ------------------------------------------------------------------------------------- > > Key: YARN-3849 > URL: https://issues.apache.org/jira/browse/YARN-3849 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.7.0 > Reporter: Sunil G > Assignee: Sunil G > Priority: Critical > > Two queues are used. Each queue has given a capacity of 0.5. Dominant > Resource policy is used. > 1. An app is submitted in QueueA which is consuming full cluster capacity > 2. After submitting an app in QueueB, there are some demand and invoking > preemption in QueueA > 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that > all containers other than AM is getting killed in QueueA > 4. Now the app in QueueB is trying to take over cluster with the current free > space. But there are some updated demand from the app in QueueA which lost > its containers earlier, and preemption is kicked in QueueB now. > Scenario in step 3 and 4 continuously happening in loop. Thus none of the > apps are completing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)