[ https://issues.apache.org/jira/browse/MAPREDUCE-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hitesh Shah reassigned MAPREDUCE-2693: -------------------------------------- Assignee: Hitesh Shah (was: Sharad Agarwal) > NPE in AM causes it to lose containers which are never returned back to RM > -------------------------------------------------------------------------- > > Key: MAPREDUCE-2693 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-2693 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Reporter: Amol Kekre > Assignee: Hitesh Shah > Priority: Critical > Fix For: 0.23.0 > > > The following exception in AM of an application at the top of queue causes > this. Once this happens, AM keeps obtaining > containers from RM and simply loses them. Eventually on a cluster with > multiple jobs, no more scheduling happens > because of these lost containers. > It happens when there are blacklisted nodes at the app level in AM. A bug in > AM > (RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - > nodes are simply getting removed from the > request-table. We should make sure RM also knows about this update. > ======================================================================== > 11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match > 98.138.163.34 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=4978 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=4977 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=1540 #asks=5 > 11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: > applicationId=30 priority=20 > resourceName=... numContainers=1539 #asks=6 > 11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM. > java.lang.NullPointerException > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220) > at java.lang.Thread.run(Thread.java:619) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira