[ https://issues.apache.org/jira/browse/YARN-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570774#comment-16570774 ]
Chandni Singh edited comment on YARN-8160 at 8/6/18 10:06 PM: -------------------------------------------------------------- Attached are the logs of container 3 that fails to re-initialize. When it is re-initialized, the container is stopped and cleanup. This causes the container to exit but here it exits with code {{255}} instead of {{FORCE_KILLED}} or {{TERMINATED}}. Since the container exits with a failure code, that is {{255}}, the status of the container in NM changes from {{REINITIALIZING_AWAITING_KILL}} to {{EXITED_WITH_FAILURE}}. Below are the relevant log stmts: 1. Reinit of the container is triggered {code:java} ctr005.log:2018-08-02 22:30:41,100 DEBUG container.ContainerImpl (ContainerImpl.java:handle(2080)) - Processing container_e02_1533231998644_0009_01_000003 of type REINITIALIZE_CONTAINER ctr005.log:2018-08-02 22:30:41,101 INFO container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e02_1533231998644_0009_01_000003 transitioned from RUNNING to REINITIALIZING_AWAITING_KIL {code} 2. Reinit triggers cleanup of the container {code:java} ctr005.log:2018-08-02 22:30:41,102 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:storeContainerKilled(555)) - storeContainerKilled: containerId=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(752)) - Marking container container_e02_1533231998644_0009_01_000003 as inactive ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(759)) - Getting pid for container container_e02_1533231998644_0009_01_000003 to kill from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1084)) - Accessing pid for container container_e02_1533231998644_0009_01_000003 from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(53)) - Accessing pid from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(103)) - Got pid 364708 from path /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1096)) - Got pid 364708 for container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(919)) - Sending signal to pid 364708 as user root for container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: inspect docker-command=inspect format=\{{.State.Status}} name=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,103 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container, --format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:41,129 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container, --format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:41,130 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:getContainerStatus(154)) - Container Status: running ContainerId: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,131 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: stop docker-command=stop name=container_e02_1533231998644_0009_01_000003 {code} 3. After 10 seconds, the stop command sent to the executor completes and the container is removed {code:java} ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, --run-docker, /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/docker.container_e02_1533231998644_0009_01_0000038521705952835205058.cmd] ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(157)) - container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,251 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(927)) - Sent signal SIGTERM to pid 364708 as user root for container container_e02_1533231998644_0009_01_000003, result=success ctr005.log:2018-08-02 22:30:51,298 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: rm docker-command=rm name=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,298 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, --remove-docker-container, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:51,977 DEBUG nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:postComplete(963)) - container_e02_1533231998644_0009_01_000003 post complete ctr005.log:2018-08-02 22:30:51,977 DEBUG resources.CGroupsHandlerImpl (CGroupsHandlerImpl.java:deleteCGroup(535)) - deleteCGroup: /sys/fs/cgroup/cpu/hadoop-yarn-tmp-ctr-e138-1518143905142-423707-01-000002.localhost/container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,997 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainerFiles(1876)) - cleanup container /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003 files ctr005.log:2018-08-02 22:30:51,998 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path : /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh ctr005.log:2018-08-02 22:30:51,998 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh] ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh] ctr005.log:2018-08-02 22:30:52,006 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path : /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens] {code} 4. Meanwhile, the container exits with exit code 255 {code:java} ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container container_e02_1533231998644_0009_01_000003 is : 255 ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(591)) - Exception from container-launch with container ID: container_e02_1533231998644_0009_01_000003 and exit code: 255 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Container id: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Shell error output: Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid /usr/bin/docker inspect --format \{{.State.Pid}} container_e02_1533231998644_0009_01_000003. ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid /usr/bin/docker inspect --format \{{.State.Pid}} container_e02_1533231998644_0009_01_000003. ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get exitcode: /usr/bin/docker inspect --format \{{.State.ExitCode}} container_e02_1533231998644_0009_01_000003. {code} The exit code 255 seems to be because the container files are cleaned up prematurely. I can think of 2 solutions: 1. In node manager, if a container is in {{REINITIALIZING_AWAITING_KILL}} and gets a CONTAINER_EXITED_WITH_FAILURE event, then it handles it in the similar way as it handles the CONTAINER_KILLED_ON_REQUEST 2. cleanup of container files is not performed until the container exits [~eyang] [~shaneku...@gmail.com] What do you think? was (Author: csingh): Attached are the logs of container 3 that fails to re-initialize. When it is re-initialized, the container is stopped and cleanup. This causes the container to exit but here it exits with code {{255}} instead of {{FORCE_KILLED}} or {{TERMINATED}}. Since the container exits with a failure code, that is {{255}}, the status of the container in NM changes from {{REINITIALIZING_AWAITING_KILL}} to {{EXITED_WITH_FAILURE}}. Below are the relevant log stmts: 1. Reinit of the container is triggered {code:java} ctr005.log:2018-08-02 22:30:41,100 DEBUG container.ContainerImpl (ContainerImpl.java:handle(2080)) - Processing container_e02_1533231998644_0009_01_000003 of type REINITIALIZE_CONTAINER ctr005.log:2018-08-02 22:30:41,101 INFO container.ContainerImpl (ContainerImpl.java:handle(2093)) - Container container_e02_1533231998644_0009_01_000003 transitioned from RUNNING to REINITIALIZING_AWAITING_KIL {code} 2. Reinit triggers cleanup of the container {code:java} ctr005.log:2018-08-02 22:30:41,102 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG recovery.NMLeveldbStateStoreService (NMLeveldbStateStoreService.java:storeContainerKilled(555)) - storeContainerKilled: containerId=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(752)) - Marking container container_e02_1533231998644_0009_01_000003 as inactive ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(759)) - Getting pid for container container_e02_1533231998644_0009_01_000003 to kill from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1084)) - Accessing pid for container container_e02_1533231998644_0009_01_000003 from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(53)) - Accessing pid from pid file /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader (ProcessIdFileReader.java:getProcessId(103)) - Got pid 364708 from path /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:getContainerPid(1096)) - Got pid 364708 for container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(919)) - Sending signal to pid 364708 as user root for container container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,102 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: inspect docker-command=inspect format=\{{.State.Status}} name=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,103 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container, --format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:41,129 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container, --format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:41,130 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:getContainerStatus(154)) - Container Status: running ContainerId: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:41,131 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: stop docker-command=stop name=container_e02_1533231998644_0009_01_000003 {code} 3. After 10 seconds, the stop command sent to the executor completes and the container is removed {code:java} ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, --run-docker, /tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/docker.container_e02_1533231998644_0009_01_0000038521705952835205058.cmd] ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(157)) - container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,251 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:signalProcess(927)) - Sent signal SIGTERM to pid 364708 as user root for container container_e02_1533231998644_0009_01_000003, result=success ctr005.log:2018-08-02 22:30:51,298 DEBUG docker.DockerCommandExecutor (DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command: rm docker-command=rm name=container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,298 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, --remove-docker-container, container_e02_1533231998644_0009_01_000003] ctr005.log:2018-08-02 22:30:51,977 DEBUG nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:postComplete(963)) - container_e02_1533231998644_0009_01_000003 post complete ctr005.log:2018-08-02 22:30:51,977 DEBUG resources.CGroupsHandlerImpl (CGroupsHandlerImpl.java:deleteCGroup(535)) - deleteCGroup: /sys/fs/cgroup/cpu/hadoop-yarn-tmp-ctr-e138-1518143905142-423707-01-000002.localhost/container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:51,997 DEBUG launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainerFiles(1876)) - cleanup container /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003 files ctr005.log:2018-08-02 22:30:51,998 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path : /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh ctr005.log:2018-08-02 22:30:51,998 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh] ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) - [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh] ctr005.log:2018-08-02 22:30:52,006 INFO nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path : /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119)) - Privileged Execution Command Array: [/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3, /tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens] {code} 4. Meanwhile, the container exits with exit code 255 {code:java} ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container container_e02_1533231998644_0009_01_000003 is : 255 ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:handleExitCode(591)) - Exception from container-launch with container ID: container_e02_1533231998644_0009_01_000003 and exit code: 255 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Container id: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Shell error output: Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid /usr/bin/docker inspect --format \{{.State.Pid}} container_e02_1533231998644_0009_01_000003. ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid /usr/bin/docker inspect --format \{{.State.Pid}} container_e02_1533231998644_0009_01_000003. ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Error: No such object: container_e02_1533231998644_0009_01_000003 ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get exitcode: /usr/bin/docker inspect --format \{{.State.ExitCode}} container_e02_1533231998644_0009_01_000003. {code} > Yarn Service Upgrade: Support upgrade of service that use docker containers > ---------------------------------------------------------------------------- > > Key: YARN-8160 > URL: https://issues.apache.org/jira/browse/YARN-8160 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Chandni Singh > Assignee: Chandni Singh > Priority: Major > Labels: Docker > Attachments: container_e02_1533231998644_0009_01_000003.nm.log > > > Ability to upgrade dockerized yarn native services. > Ref: YARN-5637 > *Background* > Container upgrade is supported by the NM via {{reInitializeContainer}} api. > {{reInitializeContainer}} does *NOT* change the ContainerId of the upgraded > container. > NM performs the following steps during {{reInitializeContainer}}: > - kills the existing process > - cleans up the container > - launches another container with the new {{ContainerLaunchContext}} > NOTE: {{ContainerLaunchContext}} holds all the information that needs to > upgrade the container. > With {{reInitializeContainer}}, the following does *NOT* change > - container ID. This is not created by NM. It is provided to it and here RM > is not creating another container allocation. > - {{localizedResources}} this stays the same if the upgrade does *NOT* > require additional resources IIUC. > > The following changes with {{reInitializeContainer}} > - the working directory of the upgraded container changes. It is *NOT* a > relaunch. > *Changes required in the case of docker container* > - {{reInitializeContainer}} seems to not be working with Docker containers. > Investigate and fix this. > - [Future change] Add an additional api to NM to pull the images and modify > {{reInitializeContainer}} to trigger docker container launch without pulling > the image first which could be based on a flag. > -- When the service upgrade is initialized, we can provide the user with > an option to just pull the images on the NMs. > -- When a component instance is upgrade, it calls the > {{reInitializeContainer}} with the flag pull-image set to false, since the NM > will have already pulled the images. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org