-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65713/#review198436
-----------------------------------------------------------




src/docker/executor.cpp
Lines 467-468 (patched)
<https://reviews.apache.org/r/65713/#comment278582>

    In this case, should we log it?



src/docker/executor.cpp
Lines 540-544 (patched)
<https://reviews.apache.org/r/65713/#comment278584>

    Should we just call stop.discard() here and return stop?
    
    If it is not killed by health check, we timed out the docker stop and we 
should discard it. If it is killed by health check, .discard() would trigger we 
do os::killtree on that docker stop subprocess and then return a failure, which 
invokes the onFailed callback below and retry.


- Gilbert Song


On Feb. 27, 2018, 5:37 p.m., Andrei Budnik wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65713/
> -----------------------------------------------------------
> 
> (Updated Feb. 27, 2018, 5:37 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Gilbert Song, Greg Mann, and 
> Vinod Kone.
> 
> 
> Bugs: MESOS-8574
>     https://issues.apache.org/jira/browse/MESOS-8574
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Previosly, if `docker inspect` command hanged, the docker container
> ended up in an unkillable state. This patch adds a timeout for inspect
> command after receiving `killTask` analogically to `reaped` handler.
> In addition we've added a timeout for `docker stop` command. If docker
> `stop` or `inspect` command times out, we discard the related future,
> thus the docker library kills previously spawned docker cli subprocess.
> As a result, a scheduler can retry `killTask` operation to handle
> nasty docker bugs that lead to hanging docker cli.
> 
> 
> Diffs
> -----
> 
>   src/docker/executor.cpp 93c3e1d1e86814e34cbe5b045f6e61911266c535 
> 
> 
> Diff: https://reviews.apache.org/r/65713/diff/5/
> 
> 
> Testing
> -------
> 
> internal CI
> 
> Manual testing:
> 1. Build docker from sources: 
> http://oyvindsk.com/writing/docker-build-from-source
> 2. Modify `ContainerInspect` function from `docker/inspect.go`:
> ```
>  func (daemon *Daemon) ContainerInspect(name string, size bool, version 
> string) (interface{}, error) {
> +       time.Sleep(10 * time.Second)
> ```
> 3. Modify `ContainerStop` function from `docker/stop.go`:
> ```
>  func (daemon *Daemon) ContainerStop(name string, seconds *int) error {
> +       rand.Seed(time.Now().UTC().UnixNano())
> +       if rand.Intn(2) == 0 {
> +               time.Sleep(20 * time.Second)
> +       }
> ```
> 4. Rebuild docker: `sudo make build && sudo make binary`
> 5. Stop system docker daemon: `sudo service docker stop`
> 6. Start modified docker daemon: `sudo ./bundles/binary-daemon/dockerd-dev`
> 7. Modify `src/cli/execute.cpp`:
>   a) Add `delay(Seconds(15), self(), &Self::retryKill, task->task_id(), 
> offer.agent_id());` after 
> https://github.com/apache/mesos/blob/072ea2787ffca6f2a6dcb2d636f68c51823d6665/src/cli/execute.cpp#L606
>   b) Add a new method `retryKill` to `CommandScheduler`:
> ```
>   void retryKill(const TaskID& taskId, const AgentID& agentId)
>   {
>     killTask(taskId, agentId);
>     delay(Seconds(6), self(), &Self::retryKill, taskId, agentId);
>   }
> ```
> 8. Rebuild mesos
> 9. Run mesos master: `./bin/mesos-master.sh --work_dir='var/master-1'`
> 10. Run mesos agent: `GLOG_v=1 ./bin/mesos-agent.sh 
> --resources="cpus:10000;mem:1000000" 
> --work_dir='/home/abudnik/mesos/build/var/agent-1' 
> --containerizers="docker,mesos" --master="127.0.1.1:5050"`
> 11. Submit a task for the docker executor: `./src/mesos-execute 
> --master="127.0.1.1:5050" --name="a" --containerizer=docker 
> --docker_image="ubuntu:xenial" --command="sleep 9999"`
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>

Reply via email to