[ https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198511#comment-17198511 ]
Dongwook Kwon commented on YARN-10428: -------------------------------------- Thanks Guang Yang for the answer. >> Hi Wenning Ding: as far as I know, mag is supposed to be non-negative unless >> there's a bug. if so, doesn't the change like this better in order to clarify the intention? {code:java} double mag = r.getSchedulingResourceUsage().getCachedUsed(CommonNodeLabelsManager.ANY).getMemorySize(); if (sizeBasedWeight && mag > 0) { double weight = Math.log1p(r.getSchedulingResourceUsage().getCachedDemand( CommonNodeLabelsManager.ANY).getMemorySize()) / Math.log(2); mag = mag / weight; } return Math.max(mag, 0);{code} Also current change will work only with the assumption of getCachedUsed and getCachedDemand both return zero at the same time, otherwise, if getCachedUsed returns non-zero but getCachedDemand returns zero, mag will be NaN. Since Math.log1p(0) / Math.log(2) is zero, mag will be divided by zero in this case. I think it's better to check these precondition in this method and return value accordingly than just assume. > Zombie applications in the YARN queue using FAIR + sizebasedweight > ------------------------------------------------------------------ > > Key: YARN-10428 > URL: https://issues.apache.org/jira/browse/YARN-10428 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 2.8.5 > Reporter: Guang Yang > Priority: Major > Attachments: YARN-10428.001.patch, YARN-10428.002.patch > > > Seeing zombie jobs in the YARN queue that uses FAIR and size based weight > ordering policy . > *Detection:* > The YARN UI shows incorrect number of "Num Schedulable Applications". > *Impact:* > The queue has an upper limit of number of running applications, with zombie > job, it hits the limit even though the number of running applications is far > less than the limit. > *Workaround:* > **Fail-over and restart Resource Manager process. > *Analysis:* > **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy# > schedulableEntities` (see attachment). Take application > "application_1599157165858_29429" for example, it is still in the > `FairOderingPolicy#schedulableEntities` set, however, if we check the log of > resource manager, we can see RM already tried to remove the application: > > ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 > 04:32:19,730 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue > (ResourceManager Event Processor): Application removed - appId: > application_1599157165858_29429 user: svc_di_data_eng queue: core-data > #user-pending-applications: -3 #user-active-applications: 7 > #queue-pending-applications: 0 #queue-active-applications: 21 > > So it appears RM failed to removed the application from the set. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org