[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472837#comment-16472837 ]
Eric Yang commented on YARN-8265: --------------------------------- [~billie.rinaldi] I am struggling to understand the reason that node manager would decide to restart the docker container without consulting with application master. AM makes the decision of the state of the containers, and node manager only follow orders from AM. This helps to prevent race conditions between AM and NM to decide which container should stay up and running. AM will follow state transitions to ensure it is following a pre-defined path. With relaunch container implemented in YARN-7973, AM still make decision when to restart container. "onContainerRestart" event will be received by AM. If we run ContainerStartedTransition again, it will check for IP changes and cancel the scheduled timer thread. I think this will leads to more desired outcome without leaving the timer thread open ended. Alternate approach is to move ContainerStatusRetriever to ContainerBecomeReadyTransition, and use BECOME_READY transition to check for IP address. > Service AM should retrieve new IP for docker container relaunched by NM > ----------------------------------------------------------------------- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services > Affects Versions: 3.1.0 > Reporter: Eric Yang > Assignee: Billie Rinaldi > Priority: Critical > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org