[ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472837#comment-16472837
 ] 

Eric Yang commented on YARN-8265:
---------------------------------

[~billie.rinaldi] I am struggling to understand the reason that node manager 
would decide to restart the docker container without consulting with 
application master.  AM makes the decision of the state of the containers, and 
node manager only follow orders from AM.  This helps to prevent race conditions 
between AM and NM to decide which container should stay up and running.  AM 
will follow state transitions to ensure it is following a pre-defined path.  
With relaunch container implemented in YARN-7973, AM still make decision when 
to restart container.  "onContainerRestart" event will be received by AM.  If 
we run ContainerStartedTransition again, it will check for IP changes and 
cancel the scheduled timer thread.  I think this will leads to more desired 
outcome without leaving the timer thread open ended.

Alternate approach is to move ContainerStatusRetriever to 
ContainerBecomeReadyTransition, and use BECOME_READY transition to check for IP 
address.

> Service AM should retrieve new IP for docker container relaunched by NM
> -----------------------------------------------------------------------
>
>                 Key: YARN-8265
>                 URL: https://issues.apache.org/jira/browse/YARN-8265
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn-native-services
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Assignee: Billie Rinaldi
>            Priority: Critical
>         Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to