[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674614#comment-16674614 ]
Wilfred Spiegelenburg commented on YARN-7560: --------------------------------------------- [~zhengchenyu] do you mind if I take over to get this finalised and checked in? > Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a > overflow value > ------------------------------------------------------------------------------------------ > > Key: YARN-7560 > URL: https://issues.apache.org/jira/browse/YARN-7560 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler, resourcemanager > Affects Versions: 3.0.0 > Reporter: zhengchenyu > Assignee: zhengchenyu > Priority: Major > Attachments: YARN-7560.000.patch, YARN-7560.001.patch > > > In our cluster, we changed the configuration, then refreshQueues, we found > the resourcemanager hangs. And the Resourcemanager can't restart > successfully. We got jstack information, always show like this: > {code} > "main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable > [0x00007f98eed9a000] > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148) > - locked <0x00007f8c4a8177a0> (a java.util.HashMap) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422) > - locked <0x00007f8c4a7eb2e0> (a > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x00007f8c4a76ac48> (a java.lang.Object) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x00007f8c49254268> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > - locked <0x00007f8c467495e0> (a java.lang.Object) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220) > {code} > When we debug the cluster, we found resourceUsedWithWeightToResourceRatio > return a negative value. So the loop can't return. We found in our cluster, > the sum of all minRes is over int.max, so > resourceUsedWithWeightToResourceRatio return a negative value. > below is the loop. Because totalResource is long, so always postive. But > resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big > that resourceUsedWithWeightToResourceRatio will return a overflow value, just > a negative. So the loop will never break. > {code} > while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) > < totalResource) { > rMax *= 2.0; > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org