----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65713/ -----------------------------------------------------------
(Updated March 2, 2018, 3:15 p.m.) Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and Vinod Kone. Bugs: MESOS-8574 https://issues.apache.org/jira/browse/MESOS-8574 Repository: mesos Description ------- Previosly, if `docker inspect` command hanged, the docker container ended up in an unkillable state. This patch adds a timeout for inspect command after receiving `killTask` analogically to `reaped` handler. In addition we've added a timeout for `docker stop` command. If docker `stop` or `inspect` command times out, we discard the related future, thus the docker library kills previously spawned docker cli subprocess. As a result, a scheduler can retry `killTask` operation to handle nasty docker bugs that lead to hanging docker cli. Diffs (updated) ----- src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 Diff: https://reviews.apache.org/r/65713/diff/6/ Changes: https://reviews.apache.org/r/65713/diff/5-6/ Testing ------- internal CI Manual testing: 1. Build docker from sources: http://oyvindsk.com/writing/docker-build-from-source 2. Modify `ContainerInspect` function from `docker/inspect.go`: ``` func (daemon *Daemon) ContainerInspect(name string, size bool, version string) (interface{}, error) { + time.Sleep(10 * time.Second) ``` 3. Modify `ContainerStop` function from `docker/stop.go`: ``` func (daemon *Daemon) ContainerStop(name string, seconds *int) error { + rand.Seed(time.Now().UTC().UnixNano()) + if rand.Intn(2) == 0 { + time.Sleep(20 * time.Second) + } ``` 4. Rebuild docker: `sudo make build && sudo make binary` 5. Stop system docker daemon: `sudo service docker stop` 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev` 7. Modify `src/cli/execute.cpp`: a) Add `delay(Seconds(15), self(), &Self::retryKill, task->task_id(), offer.agent_id());` after https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606 b) Add a new method `retryKill` to `CommandScheduler`: ``` void retryKill(const TaskID& taskId, const AgentID& agentId) { killTask(taskId, agentId); delay(Seconds(6), self(), &Self::retryKill, taskId, agentId); } ``` 8. Rebuild mesos 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'` 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh --resources="cpus:10000;mem:1000000" --work_dir='/home/abudnik/mesos/build/var/agent-1' --containerizers="docker,mesos" --master="127.0.1.1:5050"` 11. Submit a task for the docker executor: `./src/mesos-execute --master="127.0.1.1:5050" --name="a" --containerizer=docker --docker_image="ubuntu:xenial" --command="sleep 9999"` Thanks, Andrei Budnik