[ 
https://issues.apache.org/jira/browse/MESOS-9230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613833#comment-16613833
 ] 

Andrei Budnik commented on MESOS-9230:
--------------------------------------

*Important note:*
We cannot terminate the Docker executor until `docker run` finishes. What if 
right after termination of the Docker executor, the Docker daemon launches the 
task? We might end up with lots of orphaned Docker containers burdening the 
system.

In this particular case, when `docker stop` returns error "No such container", 
it might be interpreted as "Docker daemon has not started container yet". So, 
we need to wait for `docker run`.

> Docker executor may stuck in infinite loop when `docker run` hangs.
> -------------------------------------------------------------------
>
>                 Key: MESOS-9230
>                 URL: https://issues.apache.org/jira/browse/MESOS-9230
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, executor
>    Affects Versions: 1.2.3, 1.4.2, 1.5.1, 1.6.0
>            Reporter: Andrei Budnik
>            Priority: Major
>
> This issue happens due to a very slow/unresponsive Docker daemon.
> Observed behaviour of the Docker executor:
>  # Agent launches the Docker executor, which calls `docker run` to launch a 
> container.
>  # `docker inspect` hangs each time it's called, so the docker executor 
> [retries in a 
> loop|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L244-L275]
>  without success.
>  # After 5 minutes, a framework (Marathon) sends first `killTask` message, 
> which 
> [interrupts|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L543-L550]
>  the previous `docker inspect` loop.
>  # Then, `killTask()` launches the very first `docker stop`, which hangs.
>  # The framework sends the second `killTask()` after 20 seconds which 
> [interrupts|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L599-L607]
>  the first `docker stop` command.
>  # The framework continues to send `killTask()` every 20 seconds, but `docker 
> stop` always immediately returns an error: "Error response from daemon: No 
> such container: mesos-some-UID".
> Since `docker run` 
> [hangs|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L242],
>  `reaped()` 
> [callback|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L664-L693]
>  is never called. Thus, the Docker executor gets stuck in an infinite `docker 
> stop` loop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to