Re sending this from my @apache.org email in case my previous email got caught in spam.
On Mon, Oct 31, 2016 at 6:42 PM, Zameer Manji <zma...@uber.com> wrote: > Hey, > > Recently I have experienced a number of issues in a production environment > with the DockerContainerizer, Aurora and Thermos. Although my experience is > specific to Docker, I believe this applies to anyone using the Mesos > Containerizer with pid isolation. The root cause of these issues originate > to the interactions between how we launch the executor, and the role of PID > 1. > > The CommandInfo for the ExecutorInfo uses the default `shell` value which > is `true`[1]. This means that in any PID isolated container the `sh` > process that launches the executor will become PID 1. Here is an example > `ps` output from vagrant showing this: > ```` > root@aurora:/# ps auxf > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash > root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps > auxf > root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh > -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble > localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json > --mesos-containerizer > root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08 python2.7 > /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble > localhost:2181 --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json > --mesos-containerizer- > root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_ > /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 > --log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js > root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_ > /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 > --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo > root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 | > \_ /bin/bash -c while true; do echo hello world sleep 10 > done > root 468 0.0 0.0 4228 348 ? S 01:28 0:00 | > \_ sleep 10 > root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_ > /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex > --task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487 > --log_to_disk=DEBUG --hostname=192.168.33.7 --thermo > root 32 0.0 0.0 20040 1476 ? S 01:10 0:00 > \_ /bin/bash -c while true; do echo hello world sleep 10 > done > root 467 0.0 0.0 4228 352 ? S 01:28 0:00 > \_ sleep 10 > root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python > ./daemon.py > ```` > > This means processes that double fork/daemonize will be re parented to > `sh` and not our executor. You can see that the `python daemon.py` process > has been reparented to `sh` and not the executor and is outside of the > scope of the runners. This has a number of undesirable implications, > perhaps most concerning is that processes that end up reparenting to PID 1 > will not receive SIGTERM or SIGKILL from thermos but instead will be killed > by the kernel when thermos decides to to exit. If anyone here decides to > run published images that use popular software that double forks (like > nginx), you will never be able to ensure the processes die cleanly. > > I've been thinking about this problem for a while and upon advice from > others and my own research I believe the best solution is as follows: > 1. We have good reasons for setting `shell=True` when launching the > executor. I'm not comfortable changing this because I'm not sure of all of > the implications if we choose another method. > 2. The thermos runners end up forking off the target processes. I think > the runners should be responsible for all of the processes that are created > by the children. > 3. We can make the runners responsible for their grand children by using > `prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner. > This means double forked processes will be reparented to the runner and not > PID 1 > 4. On task tear down, we make the runners send SIGTERM and SIGKILL to the > PIDs they recorded and any other children they have. > 5. Each runner would need to have a SIGCHLD handler to handle zombie > processes that are reparented to it. > > [1]: https://github.com/apache/aurora/blob/783baaefb9a814ca0 > 1fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/ > scheduler/configuration/executor/ExecutorModule.java#L109-L135 > [2]: http://man7.org/linux/man-pages/man2/prctl.2.html > > -- > Zameer Manji > > -- > Zameer Manji >