[ https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418944#comment-16418944 ]
Shane Kumpf commented on YARN-7278: ----------------------------------- I believe this is now resolved with the changes added by YARN-5366. We no longer call {{docker rm}} prior to writing the exit code and no longer depend on {{docker wait}}. Closing this for now, but please reopen if you see this after applying that patch. > LinuxContainer in docker mode will be failed when nodemanager restart, > because timeout for docker is too slow. > -------------------------------------------------------------------------------------------------------------- > > Key: YARN-7278 > URL: https://issues.apache.org/jira/browse/YARN-7278 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.0 > Environment: CentOS > Reporter: zhengchenyu > Priority: Major > Fix For: 2.9.1 > > Original Estimate: 1m > Remaining Estimate: 1m > > In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer > with docker mode. > Container may be failed when nodemanager restart, exception is below: > {code} > [2017-09-29T15:47:14.433+08:00] [INFO] > containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java > 472) [Container Monitor] : Memory usage of ProcessTree 120523 for > container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical > memory used; -1B of 31 GB virtual memory used > [2017-09-29T15:47:15.219+08:00] [ERROR] > containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java > 93) [ContainersLauncher #1] : Unable to recover container > container_1506600355508_0023_01_000004 > java.io.IOException: Timeout while waiting for exit code from > container_1506600355508_0023_01_000004 > [2017-09-29T15:47:15.220+08:00] [INFO] > containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) > [AsyncDispatcher event handler] : Container > container_1506600355508_0023_01_000004 transitioned from RUNNING to > EXITED_WITH_FAILURE > [2017-09-29T15:47:15.221+08:00] [INFO] > containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java > 440) [AsyncDispatcher event handler] : Cleaning up container > container_1506600355508_0023_01_000004 > {code} > I guess the proccess is done, but 2 seconde later( the variable is msecLeft), > the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The > container is succeed when nodemanger is restart. > So I think it is too short for docker container to complete the work. > In docker mode of LinuxContainer, nm monitor the real task which is launched > by "docker run" command. Then "docker wait" command will wait for exitcode, > then "docker rm" will delete the docker container. Lastly, container-executor > will write the exit code. So if some docker command is slow enough, nm > wouldn't monitor the container. In fact, docker rm is always slow. > I think the exit code of docker rm dosen't matter with the real task, so I > think we could move the operation of write "*.pid.exitcode" before the > command of docker rm. Or monitor the docker wait proccess, but not the real > task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org