[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070289#comment-15070289 ]
Wangda Tan commented on YARN-4502: ---------------------------------- Thanks for [~yeshavora] reported this issue. Looked at this issue with [~jianhe]/[~vinodkv], root cause of this problem is: - After YARN-3535, all containers transition from ALLOCATED to KILLED state will be re-added to scheduler. And such resource request will be added to *current* scheduler application attempt. - If some containers are in ALLOCATED state and AM crashes, resource requests of these containers could be added to *new* scheduler application attempt. - When the new application attempt request AM container, it calls {code} // AM resource has been checked when submission Allocation amContainerAllocation = appAttempt.scheduler.allocate(appAttempt.applicationAttemptId, Collections.singletonList(appAttempt.amReq), EMPTY_CONTAINER_RELEASE_LIST, null, null); if (amContainerAllocation != null && amContainerAllocation.getContainers() != null) { assert (amContainerAllocation.getContainers().size() == 0); } {code} Some containers could be allocated of this scheduler.allocate call, these container will be ignored because the following *assert* is not enabled in production environment. - So this results to some container could be possibly leaked when we allocating retried AM containers. *Possible fixes*: 1) Release all allocated container of {{amContainerAllocation.getContainers()}} OR 2) Instead of using {{getCurrentAttemptForContainer}} in {{AbstractYarnScheduler#recoverResourceRequestForContainer}}, we should only recover ResourceRequest to the attempt which includes the container. > Sometimes Two AM containers get launched > ---------------------------------------- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Yesha Vora > Assignee: Wangda Tan > Priority: Critical > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 60000 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_000002 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port> > Total number of containers :2 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa > container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_000001. But AM was not launched on > container_e12_1450825622869_0001_02_000001. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_000002 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)