[ 
https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengchenyu updated YARN-7278:
------------------------------
    Affects Version/s:     (was: 2.7.1)
                       2.8.0

> LinuxContainer in docker mode will be failed when nodemanager restart, 
> because timeout for docker is too slow.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7278
>                 URL: https://issues.apache.org/jira/browse/YARN-7278
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>         Environment: CentOS
>            Reporter: zhengchenyu
>             Fix For: 2.9.0
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer 
> with docker mode.
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO] 
> containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 
> 472) [Container Monitor] : Memory usage of ProcessTree 120523 for 
> container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical 
> memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR] 
> containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
>  93) [ContainersLauncher #1] : Unable to recover container 
> container_1506600355508_0023_01_000004
> java.io.IOException: Timeout while waiting for exit code from 
> container_1506600355508_0023_01_000004
> [2017-09-29T15:47:15.220+08:00] [INFO] 
> containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) 
> [AsyncDispatcher event handler] : Container 
> container_1506600355508_0023_01_000004 transitioned from RUNNING to 
> EXITED_WITH_FAILURE
> [2017-09-29T15:47:15.221+08:00] [INFO] 
> containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
>  440) [AsyncDispatcher event handler] : Cleaning up container 
> container_1506600355508_0023_01_000004
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft), 
> the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The 
> container is succeed when nodemanger is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched 
> by "docker run" command. Then "docker wait" command will wait for exitcode, 
> then "docker rm" will delete the docker container. Lastly, container-executor 
> will write the exit code. So if some docker command is slow enough, nm 
> wouldn't monitor the container. In fact, docker rm is always slow. 
> I think the exit code of docker rm dosen't matter with the real task, so I 
> think we could move the operation of write "*.pid.exitcode" before the 
> command of docker rm. Or monitor the docker wait proccess, but not the real 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to