[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092612#comment-15092612 ]
Vinod Kumar Vavilapalli commented on YARN-4502: ----------------------------------------------- [~leftnoteasy], I started writing a test for this assuming the previous hypothesis, and it doesn't add up. bq. After YARN-3535, all containers transition from ALLOCATED to KILLED state will be re-added to scheduler. And such resource request will be added to current scheduler application attempt. Two cases here # If the container (in allocated state) got killed before the AM crash, it will get added to the app-attempt #1, so this bug won't happen # An allocated container simply doesn't survive AM crash (both when keepContainerAcrossApplicationAttempt is on and off) - scheduler itself kills all allocated containers right after AM crashes as part of {{doneApplicationAtttempt()}}. And these killed containers also get added to the app-attempt #1 because current-app-attempt is not switched till {{addApplicationAttempt()}} comes in for the new app-attempt. So, it doesn't look like our previous analysis is right. /cc [~jianhe] [~yeshavora], do you have the RM logs? > Sometimes Two AM containers get launched > ---------------------------------------- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Yesha Vora > Assignee: Wangda Tan > Priority: Critical > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 60000 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_000002 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port> > Total number of containers :2 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa > container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_000001. But AM was not launched on > container_e12_1450825622869_0001_02_000001. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_000002 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)