Andrei Budnik created MESOS-9230: ------------------------------------ Summary: Docker executor may stuck in infinite loop when `docker run` hangs. Key: MESOS-9230 URL: https://issues.apache.org/jira/browse/MESOS-9230 Project: Mesos Issue Type: Bug Components: docker, executor Affects Versions: 1.6.0, 1.5.1, 1.4.2, 1.2.3 Reporter: Andrei Budnik
This issue happens due to a very slow/unresponsive Docker daemon. Observed behaviour of the Docker executor: # Agent launches the Docker executor, which calls `docker run` to launch a container. # `docker inspect` hangs each time it's called, so the docker executor [retries in a loop|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L244-L275] without success. # After 5 minutes, a framework (Marathon) sends first `killTask` message, which [interrupts|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L543-L550] the previous `docker inspect` loop. # Then, `killTask()` launches the very first `docker stop`, which hangs. # The framework sends the second `killTask()` after 20 seconds which [interrupts|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L599-L607] the first `docker stop` command. # The framework continues to send `killTask()` every 20 seconds, but `docker stop` always immediately returns an error: "Error response from daemon: No such container: mesos-some-UID". Since `docker run` [hangs|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L242], `reaped()` [callback|https://github.com/apache/mesos/blob/master/src/docker/executor.cpp#L664-L693] is never called. Thus, the Docker executor gets stuck in an infinite `docker stop` loop. -- This message was sent by Atlassian JIRA (v7.6.3#76005)