[jira] [Commented] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

Shane Kumpf (JIRA) Thu, 29 Mar 2018 05:56:21 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418944#comment-16418944
 ]


Shane Kumpf commented on YARN-7278:
-----------------------------------

I believe this is now resolved with the changes added by YARN-5366. We no 
longer call {{docker rm}} prior to writing the exit code and no longer depend 
on {{docker wait}}. Closing this for now, but please reopen if you see this 
after applying that patch.

> LinuxContainer in docker mode will be failed when nodemanager restart, 
> because timeout for docker is too slow.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7278
>                 URL: https://issues.apache.org/jira/browse/YARN-7278
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>         Environment: CentOS
>            Reporter: zhengchenyu
>            Priority: Major
>             Fix For: 2.9.1
>
>   Original Estimate: 1m
>  Remaining Estimate: 1m
>
> In our cluster, nodemanagere recovery is turn on, and we use LinuxConainer 
> with docker mode.
> Container may be failed when nodemanager restart, exception is below:
> {code}
> [2017-09-29T15:47:14.433+08:00] [INFO] 
> containermanager.monitor.ContainersMonitorImpl.run(ContainersMonitorImpl.java 
> 472) [Container Monitor] : Memory usage of ProcessTree 120523 for 
> container-id container_1506600355508_0023_01_000004: -1B of 10 GB physical 
> memory used; -1B of 31 GB virtual memory used
> [2017-09-29T15:47:15.219+08:00] [ERROR] 
> containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java
>  93) [ContainersLauncher #1] : Unable to recover container 
> container_1506600355508_0023_01_000004
> java.io.IOException: Timeout while waiting for exit code from 
> container_1506600355508_0023_01_000004
> [2017-09-29T15:47:15.220+08:00] [INFO] 
> containermanager.container.ContainerImpl.handle(ContainerImpl.java 1142) 
> [AsyncDispatcher event handler] : Container 
> container_1506600355508_0023_01_000004 transitioned from RUNNING to 
> EXITED_WITH_FAILURE
> [2017-09-29T15:47:15.221+08:00] [INFO] 
> containermanager.launcher.ContainerLaunch.cleanupContainer(ContainerLaunch.java
>  440) [AsyncDispatcher event handler] : Cleaning up container 
> container_1506600355508_0023_01_000004
> {code}
> I guess the proccess is done, but 2 seconde later( the variable is msecLeft), 
> the *.pid.exitcode wasn't created. Then I changed variable to 20000ms, The 
> container is succeed when nodemanger is restart.
> So I think it is too short for docker container to complete the work.
> In docker mode of LinuxContainer, nm monitor the real task which is launched 
> by "docker run" command. Then "docker wait" command will wait for exitcode, 
> then "docker rm" will delete the docker container. Lastly, container-executor 
> will write the exit code. So if some docker command is slow enough, nm 
> wouldn't monitor the container. In fact, docker rm is always slow. 
> I think the exit code of docker rm dosen't matter with the real task, so I 
> think we could move the operation of write "*.pid.exitcode" before the 
> command of docker rm. Or monitor the docker wait proccess, but not the real 
> task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-7278) LinuxContainer in docker mode will be failed when nodemanager restart, because timeout for docker is too slow.

Reply via email to