[ 
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092937#comment-15092937
 ] 

Wangda Tan commented on YARN-4502:
----------------------------------

Hi [~vinodkv],

My understanding this could happen when:
- CS calls doneApplicationAttempt
- Which causes Containers are killed when they're at allocated state
- ContainerRescheduledEvent added to event queue
{code}
  private static final class ContainerRescheduledTransition extends
      FinishedTransition {

    @Override
    public void transition(RMContainerImpl container, RMContainerEvent event) {
      // Tell scheduler to recover request of this container to app
      container.eventHandler.handle(new ContainerRescheduledEvent(container));
      super.transition(container, event);
    }
  }
{code}
- If add-application-attempt-event sent to scheduler before 
container-rescheduled-event arrives, application attempt will be replaced so 
resource request will be restored to next attempt.

Thoughts?

> Sometimes Two AM containers get launched
> ----------------------------------------
>
>                 Key: YARN-4502
>                 URL: https://issues.apache.org/jira/browse/YARN-4502
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Yesha Vora
>            Assignee: Wangda Tan
>            Priority: Critical
>
> Scenario : 
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
>  yarn  org.apache.hadoop.yarn.applications.distributedshell.Client -jar 
> hadoop-yarn-applications-distributedshell-*.jar 
> -attempt_failures_validity_interval 60000 -shell_command "sleep 150" 
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_000002
> INFO impl.TimelineClientImpl: Timeline service address: 
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port>
> Total number of containers :2
> Container-Id                 Start Time             Finish Time               
>     State                    Host       Node Http Address                     
>            LOG-URL
> container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015     
>               N/A                 RUNNING    xxx:25454       http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa
> container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015     
>               N/A                 RUNNING    xxx:25454       http://xxx:8042 
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa
> {code}
> * look for new AM pid 
> Here, 2nd AM container was suppose to be started on  
> container_e12_1450825622869_0001_02_000001. But AM was not launched on 
> container_e12_1450825622869_0001_02_000001. It was in AQUIRED state. 
> On other hand, container_e12_1450825622869_0001_02_000002 got the AM running. 
> Expected behavior: RM should not start 2 containers for starting AM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to