[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15106507#comment-15106507 ]
Karam Singh commented on YARN-4606: ----------------------------------- >From offline discussion with [~wangda]: After looked at log & code, I think I understand what happened: The root cause is: we shouldn't activate application when it's in pending state. This is not a new issue, at least branch-2.6 contains this issue. This leads to #active-users in a queue increased, but new added active user cannot get resource (because application is in pending state) and old user hits user-limit (new added user lowers user-limits). > Sometimes Fairness inconjuncttions with UserLimitPercent and UserLimitFactor > in queue leads to situation where it appears that applications in queue are > getting starved or stuck > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > > Encountered while studying behaviour fairness with UserLimitPercent and > UserLimitFactor during following test: > Ran GridMix with Queue settings: Capacity=10, MaxCap=80, UserLimit=25 > UserLimitFactor=32, FairOrderingPolicy only. Encountered a application > starving situation where 33 application (190 apps completed out of 761 apps, > queue can 345 containers) are running with total of 45 containers running, > and that 12 extra only one app(the app was having around 18000 tasks) , all > other apps were having AM running only no other containers were given any > apps. After that app finished, there were 32 AMs that kept running without > any containers for task being launched > GridMix was run with following settings: > gridmix.client.pending.queue.depth=10, gridmix.job-submission.policy=REPLAY, > gridmix.client.submit.threads=5, gridmix.submit.multiplier=0.0001, > gridmix.job.type=SLEEPJOB, mapreduce.framework.name=yarn, > mapreduce.job.queuename=hive1, mapred.job.queue.name=hive1, > gridmix.sleep.max-map-time=5000, gridmix.sleep.max-reduce-time=5000, > gridmix.user.resolve.class=org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver > With Users file containing 4 users for RoundRobinUserResolver -- This message was sent by Atlassian JIRA (v6.3.4#6332)