[ https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wilfred Spiegelenburg reassigned YARN-10112: -------------------------------------------- Assignee: Wilfred Spiegelenburg > Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used > with Fair Scheduler size based weights enabled > --------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-10112 > URL: https://issues.apache.org/jira/browse/YARN-10112 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.8.5 > Reporter: Yu Wang > Assignee: Wilfred Spiegelenburg > Priority: Minor > > The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is > set true. From the ticket JStack thread dump from the support engineers, we > could see that the method getAppWeight below in the class of FairScheduler > was occupying the FairScheduler object monitor always, which made > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate > always await of entering the same object monitor, thus resulting in the the > livelock. > > The issue occurs very infrequently and we are still unable to figure out a > way to consistently reproduce the issue. The issue resembles to what the Jira > YARN-1458 reports, but it seems that code fix has taken into effect since > 2.6. > > > {code:java} > "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x00007fbcee65e800 > nid=0x2ea4 waiting for monitor entry [0x00007fbcbcd5e000] > java.lang.Thread.State: BLOCKED (on object monitor) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105) > - waiting to lock <0x00000006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801) > at java.lang.Thread.run(Thread.java:748) > "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 > tid=0x00007fbceea0e800 nid=0x2ea2 runnable [0x00007fbcbcf60000] > java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) > at java.lang.Math.log1p(Math.java:1747) at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570) > - locked <0x00000006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365) > - locked <0x00000006eb816b18> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org