[jira] [Commented] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

Dongwook Kwon (Jira) Fri, 18 Sep 2020 13:09:59 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198511#comment-17198511
 ]


Dongwook Kwon commented on YARN-10428:
--------------------------------------

Thanks Guang Yang for the answer.

>> Hi Wenning Ding: as far as I know, mag is supposed to be non-negative unless 
>> there's a bug.

if so, doesn't the change like this better in order to clarify the intention?
{code:java}
double mag = 
r.getSchedulingResourceUsage().getCachedUsed(CommonNodeLabelsManager.ANY).getMemorySize();
if (sizeBasedWeight && mag > 0) {
 double weight = Math.log1p(r.getSchedulingResourceUsage().getCachedDemand(
 CommonNodeLabelsManager.ANY).getMemorySize()) / Math.log(2);
 mag = mag / weight;
}
return Math.max(mag, 0);{code}
 

Also current change will work only with the assumption of getCachedUsed and 
getCachedDemand both return zero at the same time, otherwise,  if getCachedUsed 
returns non-zero but getCachedDemand returns zero, mag will be NaN.

Since Math.log1p(0) / Math.log(2) is zero, mag will be divided by zero in this 
case.

I think it's better to check these precondition in this method and return value 
accordingly than just assume.

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> ------------------------------------------------------------------
>
>                 Key: YARN-10428
>                 URL: https://issues.apache.org/jira/browse/YARN-10428
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.5
>            Reporter: Guang Yang
>            Priority: Major
>         Attachments: YARN-10428.001.patch, YARN-10428.002.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight 
> ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie 
> job, it hits the limit even though the number of running applications is far 
> less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application 
> "application_1599157165858_29429" for example, it is still in the  
> `FairOderingPolicy#schedulableEntities` set, however, if we check the log of 
> resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 
> 04:32:19,730 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue 
> (ResourceManager Event Processor): Application removed - appId: 
> application_1599157165858_29429 user: svc_di_data_eng queue: core-data 
> #user-pending-applications: -3 #user-active-applications: 7 
> #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

Reply via email to