[ 
https://issues.apache.org/jira/browse/YARN-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026740#comment-17026740
 ] 

Yu Wang commented on YARN-10112:
--------------------------------

Thank you Wilfred for pointing out the resolution! 

> Livelock (Runnable FairScheduler.getAppWeight) in Resource Manager when used 
> with Fair Scheduler size based weights enabled
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10112
>                 URL: https://issues.apache.org/jira/browse/YARN-10112
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.8.5
>            Reporter: Yu Wang
>            Assignee: Wilfred Spiegelenburg
>            Priority: Minor
>
> The user uses the FairScheduler, and yarn.scheduler.fair.sizebasedweight is 
> set true. From the ticket JStack thread dump from the support engineers, we 
> could see that the method getAppWeight below in the class of FairScheduler 
> was occupying the FairScheduler object monitor always, which made 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate
>  always await of entering the same object monitor, thus resulting in the the 
> livelock.
>  
> The issue occurs very infrequently and we are still unable to figure out a 
> way to consistently reproduce the issue. The issue resembles to what the Jira 
> YARN-1458 reports, but it seems that code fix has taken into effect since 
> 2.6. 
>  
>  
> {code:java}
> "ResourceManager Event Processor" #17 prio=5 os_prio=0 tid=0x00007fbcee65e800 
> nid=0x2ea4 waiting for monitor entry [0x00007fbcbcd5e000] 
> java.lang.Thread.State: BLOCKED (on object monitor) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:1105)
>  - waiting to lock <0x00000006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1362)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:129)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:801)
>  at java.lang.Thread.run(Thread.java:748) 
> "FairSchedulerUpdateThread" #23 daemon prio=5 os_prio=0 
> tid=0x00007fbceea0e800 nid=0x2ea2 runnable [0x00007fbcbcf60000] 
> java.lang.Thread.State: RUNNABLE at java.lang.StrictMath.log1p(Native Method) 
> at java.lang.Math.log1p(Math.java:1747) at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:570)
>  - locked <0x00000006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:953)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:192)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:51)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:138)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:235)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:89)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:365)
>  - locked <0x00000006eb816b18> (a 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:314){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to