[ https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengchenyu updated YARN-7278: ------------------------------ Affects Version/s: (was: 2.7.1) 2.8.0 > LinuxContainer in docker mode will be failed when nodemanager restart, > because timeout for docker is too slow. > -------------------------------------------------------------------------------------------------------------- > > Key: YARN-7278 > URL: https://issues.apache.org/jira/browse/YARN-7278 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.0 > Environment: CentOS > Reporter: zhengchenyu > Fix For: 2.9.0 > > Original Estimate: 1m > Remaining Estimate: 1m > > In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer > with docker mode. > Container may be failed when nodemanager restart, exception is below: > {code} > [2017-09-29T15:47:14.433+08:00] [INFO] > containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java > 472) [Container Monitor] : Memory usage of ProcessTree 120523 for > container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical > memory used; -1B of 31 GB virtual memory used > [2017-09-29T15:47:15.219+08:00] [ERROR] > containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java > 93) [ContainersLauncher #1] : Unable to recover container > container_1506600355508_0023_01_000004 > java.io.IOException: Timeout while waiting for exit code from > container_1506600355508_0023_01_000004 > [2017-09-29T15:47:15.220+08:00] [INFO] > containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) > [AsyncDispatcher event handler] : Container > container_1506600355508_0023_01_000004 transitioned from RUNNING to > EXITED_WITH_FAILURE > [2017-09-29T15:47:15.221+08:00] [INFO] > containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java > 440) [AsyncDispatcher event handler] : Cleaning up container > container_1506600355508_0023_01_000004 > {code} > I guess the proccess is done, but 2 seconde later( the variable is msecLeft), > the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The > container is succeed when nodemanger is restart. > So I think it is too short for docker container to complete the work. > In docker mode of LinuxContainer, nm monitor the real task which is launched > by "docker run" command. Then "docker wait" command will wait for exitcode, > then "docker rm" will delete the docker container. Lastly, container-executor > will write the exit code. So if some docker command is slow enough, nm > wouldn't monitor the container. In fact, docker rm is always slow. > I think the exit code of docker rm dosen't matter with the real task, so I > think we could move the operation of write "*.pid.exitcode" before the > command of docker rm. Or monitor the docker wait proccess, but not the real > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org