Hey,
Recently I have experienced a number of issues in a production environment
with the DockerContainerizer, Aurora and Thermos. Although my experience is
specific to Docker, I believe this applies to anyone using the Mesos
Containerizer with pid isolation. The root cause of these issues originate
to the interactions between how we launch the executor, and the role of PID
1.
The CommandInfo for the ExecutorInfo uses the default `shell` value which
is `true`[1]. This means that in any PID isolated container the `sh`
process that launches the executor will become PID 1. Here is an example
`ps` output from vagrant showing this:
````
root@aurora:/# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 250 0.0 0.0 21928 2124 ? Ss 01:19 0:00 /bin/bash
root 469 0.0 0.0 19176 1240 ? R+ 01:28 0:00 \_ ps auxf
root 1 0.0 0.0 4328 636 ? Ss 01:10 0:00 /bin/sh -c
${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer
root 5 0.7 1.4 1201128 45604 ? Sl 01:10 0:08 python2.7
/mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181
--announcer-zookeeper-auth-config /home/vagrant/aurora/examples/
vagrant/config/announcer-auth.json --mesos-containerizer-
root 23 0.1 0.6 115668 20764 ? S 01:10 0:01 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermos_js
root 29 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root 34 0.0 0.0 20040 1476 ? S 01:10 0:00 |
\_ /bin/bash -c while true; do echo hello world sleep 10
done
root 468 0.0 0.0 4228 348 ? S 01:28 0:00 |
\_ sleep 10
root 31 0.0 0.5 113476 17936 ? Ss 01:10 0:00 \_
/usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex
--task_id=www-data-devel-hello_docker_engine-0-5f443832-a13e-4cde-97e3-89aa905f2487
--log_to_disk=DEBUG --hostname=192.168.33.7 --thermo
root 32 0.0 0.0 20040 1476 ? S 01:10 0:00
\_ /bin/bash -c while true; do echo hello world sleep 10
done
root 467 0.0 0.0 4228 352 ? S 01:28 0:00
\_ sleep 10
root 47 0.0 0.0 24116 3052 ? S 01:10 0:00 python
./daemon.py
````
This means processes that double fork/daemonize will be re parented to `sh`
and not our executor. You can see that the `python daemon.py` process has
been reparented to `sh` and not the executor and is outside of the scope of
the runners. This has a number of undesirable implications, perhaps most
concerning is that processes that end up reparenting to PID 1 will not
receive SIGTERM or SIGKILL from thermos but instead will be killed by the
kernel when thermos decides to to exit. If anyone here decides to run
published images that use popular software that double forks (like nginx),
you will never be able to ensure the processes die cleanly.
I've been thinking about this problem for a while and upon advice from
others and my own research I believe the best solution is as follows:
1. We have good reasons for setting `shell=True` when launching the
executor. I'm not comfortable changing this because I'm not sure of all of
the implications if we choose another method.
2. The thermos runners end up forking off the target processes. I think the
runners should be responsible for all of the processes that are created by
the children.
3. We can make the runners responsible for their grand children by using
`prctl(2)`[2] and setting the `PR_SET_CHILD_SUBREAPER` bit for each runner.
This means double forked processes will be reparented to the runner and not
PID 1
4. On task tear down, we make the runners send SIGTERM and SIGKILL to the
PIDs they recorded and any other children they have.
5. Each runner would need to have a SIGCHLD handler to handle zombie
processes that are reparented to it.
[1]: https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3d
f3de5b34af/src/main/java/org/apache/aurora/scheduler/configuration/executor/
ExecutorModule.java#L109-L135
[2]: http://man7.org/linux/man-pages/man2/prctl.2.html
--
Zameer Manji