[ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106507#comment-15106507
 ] 

Karam Singh commented on YARN-4606:
-----------------------------------

>From offline discussion with [~wangda]:
After looked at log & code, I think I understand what happened:
The root cause is: we shouldn't activate application when it's in pending 
state. This is not a new issue, at least branch-2.6 contains this issue.
This leads to #active-users in a queue increased, but new added active user 
cannot get resource (because application is in pending state) and old user hits 
user-limit (new added user lowers user-limits).


> Sometimes Fairness inconjuncttions with UserLimitPercent and UserLimitFactor 
> in queue leads to situation where it appears that applications in queue are 
> getting starved or stuck
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4606
>                 URL: https://issues.apache.org/jira/browse/YARN-4606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>    Affects Versions: 2.8.0, 2.7.1
>            Reporter: Karam Singh
>
> Encountered while studying behaviour fairness with UserLimitPercent and 
> UserLimitFactor during following test:
> Ran GridMix with Queue settings: Capacity=10, MaxCap=80, UserLimit=25 
> UserLimitFactor=32, FairOrderingPolicy only. Encountered a application 
> starving situation where 33 application (190 apps completed out of 761 apps, 
> queue can 345 containers) are running with total of 45 containers running, 
> and that 12 extra only one app(the app was having around 18000 tasks) , all 
> other apps were having AM running only no other containers were given any 
> apps. After that app finished, there were 32 AMs that kept running without 
> any containers for task being launched
> GridMix was run with following settings:
> gridmix.client.pending.queue.depth=10, gridmix.job-submission.policy=REPLAY, 
> gridmix.client.submit.threads=5, gridmix.submit.multiplier=0.0001, 
> gridmix.job.type=SLEEPJOB, mapreduce.framework.name=yarn, 
> mapreduce.job.queuename=hive1, mapred.job.queue.name=hive1, 
> gridmix.sleep.max-map-time=5000, gridmix.sleep.max-reduce-time=5000, 
> gridmix.user.resolve.class=org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver
>  With Users file containing 4 users for RoundRobinUserResolver



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to