Re sending this from my @apache.org email in case my previous email got
caught in spam.

On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote:

> Hey,
>
> Recently I have experienced a number of issues in a production environment
> with the DockerContainerizer, Aurora and Thermos. Although my experience is
> specific to Docker, I believe this applies to anyone using the Mesos
> Containerizer with pid isolation. The root cause of these issues originate
> to the interactions between how we launch the executor, and the role of PID
> 1.
>
> The CommandInfo for the ExecutorInfo uses the default `shell` value which
> is `true`[1]. This means that in any PID isolated container the `sh`
> process that launches the executor will become PID 1. Here is an example
> `ps` output from vagrant showing this:
> ````
> root@aurora:/# ps auxf
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root       250  0.0  0.0  21928  2124 ?        Ss   01:19   0:00 /bin/bash
> root       469  0.0  0.0  19176  1240 ?        R+   01:28   0:00  \_ ps
> auxf
> root         1  0.0  0.0   4328   636 ?        Ss   01:10   0:00 /bin/sh
> -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer
> root         5  0.7  1.4 1201128 45604 ?       Sl   01:10   0:08 python2.7
> /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble
> localhost:2181 --announcer-zookeeper-auth-config
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json
> --mesos-containerizer-
> root        23  0.1  0.6 115668 20764 ?        S    01:10   0:01  \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
> root        29  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root        34  0.0  0.0  20040  1476 ?        S    01:10   0:00      |
> \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>   done
> root       468  0.0  0.0   4228   348 ?        S    01:28   0:00      |
>     \_ sleep 10
> root        31  0.0  0.5 113476 17936 ?        Ss   01:10   0:00      \_
> /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
> --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
> --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
> root        32  0.0  0.0  20040  1476 ?        S    01:10   0:00
>  \_ /bin/bash -c      while true; do       echo hello world       sleep 10
>     done
> root       467  0.0  0.0   4228   352 ?        S    01:28   0:00
>    \_ sleep 10
> root        47  0.0  0.0  24116  3052 ?        S    01:10   0:00 python
> ./daemon.py
> ````
>
> This means processes that double fork/daemonize will be re parented to
> `sh` and not our executor. You can see that the `python daemon.py` process
> has been reparented to `sh` and not the executor and is outside of the
> scope of the runners. This has a number of undesirable implications,
> perhaps most concerning is that processes that end up reparenting to PID 1
> will not receive SIGTERM or SIGKILL from thermos but instead will be killed
> by the kernel when thermos decides to to exit. If anyone here decides to
> run published images that use popular software that double forks (like
> nginx), you will never be able to ensure the processes die cleanly.
>
> I've been thinking about this problem for a while and upon advice from
> others and my own research I believe the best solution is as follows:
> 1. We have good reasons for setting `shell=True` when launching the
> executor. I'm not comfortable changing this because I'm not sure of all of
> the implications if we choose another method.
> 2. The thermos runners end up forking off the target processes. I think
> the runners should be responsible for all of the processes that are created
> by the children.
> 3. We can make the runners responsible for their grand children by using
> `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
> This means double forked processes will be reparented to the runner and not
> PID 1
> 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
> PIDs they recorded and any other children they have.
> 5. Each runner would need to have a SIGCHLD handler to handle zombie
> processes that are reparented to it.
>
> [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0
> 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/
> scheduler/configuration/executor/ExecutorModule.java#L109-L135
> [2]: http://man7.org/linux/man-pages/man2/prctl.2.html
>
> --
> Zameer Manji
>
> --
> Zameer Manji
>

Reply via email to