[ https://issues.apache.org/jira/browse/YARN-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299461#comment-16299461 ]
Wilfred Spiegelenburg commented on YARN-4227: --------------------------------------------- Tests pass locally Test failures also can not be related because they are hard coded to only test the capacity scheduler. > FairScheduler: RM quits processing expired container from a removed node > ------------------------------------------------------------------------ > > Key: YARN-4227 > URL: https://issues.apache.org/jira/browse/YARN-4227 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.3.0, 2.5.0, 2.7.1 > Reporter: Wilfred Spiegelenburg > Assignee: Wilfred Spiegelenburg > Priority: Critical > Attachments: YARN-4227.2.patch, YARN-4227.3.patch, YARN-4227.4.patch, > YARN-4227.5.patch, YARN-4227.patch > > > Under some circumstances the node is removed before an expired container > event is processed causing the RM to exit: > {code} > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.util.AbstractLivelinessMonitor: > Expired:container_1436927988321_1307950_01_000012 Timed out after 600 secs > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_1436927988321_1307950_01_000012 Container Transitioned from > ACQUIRED to EXPIRED > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp: > Completed container: container_1436927988321_1307950_01_000012 in state: > EXPIRED event:EXPIRE > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=system_op > OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS > APPID=application_1436927988321_1307950 > CONTAINERID=container_1436927988321_1307950_01_000012 > 2015-10-04 21:14:01,063 FATAL > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in > handling event type CONTAINER_EXPIRED to the scheduler > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.completedContainer(FairScheduler.java:849) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1273) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:122) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:585) > at java.lang.Thread.run(Thread.java:745) > 2015-10-04 21:14:01,063 INFO > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. > {code} > The stack trace is from 2.3.0 but the same issue has been observed in 2.5.0 > and 2.6.0 by different customers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org