[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092937#comment-15092937 ]
Wangda Tan commented on YARN-4502: ---------------------------------- Hi [~vinodkv], My understanding this could happen when: - CS calls doneApplicationAttempt - Which causes Containers are killed when they're at allocated state - ContainerRescheduledEvent added to event queue {code} private static final class ContainerRescheduledTransition extends FinishedTransition { @Override public void transition(RMContainerImpl container, RMContainerEvent event) { // Tell scheduler to recover request of this container to app container.eventHandler.handle(new ContainerRescheduledEvent(container)); super.transition(container, event); } } {code} - If add-application-attempt-event sent to scheduler before container-rescheduled-event arrives, application attempt will be replaced so resource request will be restored to next attempt. Thoughts? > Sometimes Two AM containers get launched > ---------------------------------------- > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Yesha Vora > Assignee: Wangda Tan > Priority: Critical > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 60000 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_000002 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port> > Total number of containers :2 > Container-Id Start Time Finish Time > State Host Node Http Address > LOG-URL > container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa > container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015 > N/A RUNNING xxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_000001. But AM was not launched on > container_e12_1450825622869_0001_02_000001. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_000002 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)