[jira] [Reopened] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

Wilfred Spiegelenburg (JIRA) Wed, 23 Aug 2017 22:43:45 -0700

     [ 
https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wilfred Spiegelenburg reopened YARN-4227:
-----------------------------------------

[~Steven Rand] looks like you are correct, lets re-open this issue.
The {{containerCompleted}} call we do just before we release the container on 
the node is synchronised which means that we could have been waiting there for 
just long enough to cause the node to become {{null}}
The patch will need a rebase due to all the pre-emption changes that have gone 
in and I'll upload a new patch soon.

> FairScheduler: RM quits processing expired container from a removed node
> ------------------------------------------------------------------------
>
>                 Key: YARN-4227
>                 URL: https://issues.apache.org/jira/browse/YARN-4227
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.3.0, 2.5.0, 2.7.1
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Critical
>         Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, 
> YARN-4227.patch
>
>
> Under some circumstances the node is removed before an expired container 
> event is processed causing the RM to exit:
> {code}
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: 
> Expired:container_1436927988321_1307950_01_000012 Timed out after 600 secs
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> container_1436927988321_1307950_01_000012 Container Transitioned from 
> ACQUIRED to EXPIRED
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: 
> Completed container: container_1436927988321_1307950_01_000012 in state: 
> EXPIRED event:EXPIRE
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op   
>    OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  
> APPID=application_1436927988321_1307950 
> CONTAINERID=container_1436927988321_1307950_01_000012
> 2015-10-04 21:14:01,063 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in 
> handling event type CONTAINER_EXPIRED to the scheduler
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-10-04 21:14:01,063 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 
> and 2.6.0 by different customers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Reopened] (YARN-4227) FairScheduler: RM quits processing expired container from a removed node

Reply via email to