----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/53403/ -----------------------------------------------------------
(Updated Nov. 2, 2016, 1:32 p.m.) Review request for Aurora, Joshua Cohen, Santhosh Kumar Shanmugham, and Stephan Erb. Bugs: AURORA-1808 https://issues.apache.org/jira/browse/AURORA-1808 Repository: aurora Description (updated) ------- This is a WIP patch showing a possible fix to AURORA-1808. # Problem Processes can deamonize and escape the supervision of a coordinator. Using the Docker Containerizer or the Mesos Containerizer with pid isolation means that the processes will be come reparented to the `sh` process that launches the executor. For example: ```` root@aurora:/# ps xf PID TTY STAT TIME COMMAND 48 ? Ss 0:00 /bin/bash 86 ? R+ 0:00 _ ps xf 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag 23 ? S 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152 -- 29 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 32 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 81 ? S 0:00 | _ sleep 10 31 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15 33 ? S 0:00 _ /bin/bash -c while true; do echo hello world sleep 10 done 82 ? S 0:00 _ sleep 10 47 ? S 0:00 python ./daemon.py ```` # Solution Ensure processes that escape the supervision of the coordinator reparent to the runner who can send signals to them on task tear down. We do this by using the `PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`. After this change the process tree looks like: ```` root@aurora:/# ps xf PID TTY STAT TIME COMMAND 66 ? Ss 0:00 /bin/bash 70 ? R+ 0:00 _ ps xf 1 ? Ss 0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/va 5 ? Sl 0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex --announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config /home/vagrant/aurora/examples/vag 23 ? S 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849 -- 33 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 40 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 63 ? S 0:00 | _ sleep 10 36 ? Ss 0:00 _ /usr/local/bin/python2.7 /mnt/mesos/sandbox/thermos_runner.pex --task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84 37 ? S 0:00 | _ /bin/bash -c while true; do echo hello world sleep 10 done 62 ? S 0:00 | _ sleep 10 55 ? S 0:00 _ python ./daemon.py ```` Now the runner is aware of the reparented procesess can can tear it down cleanly during teardown. Note that the man page for `prctl(2)` says that the processes that set `PR_SET_CHILD_SUBREAPER` should reap children to get rid of zombies. It is important to note tht the runner already does this in its run loop via `TaskRunnerHelper.reap_children()`. This patch has the side effect of ensuring it will reap all of the children launched via coordinators. Diffs ----- src/main/python/apache/thermos/common/process_util.py abd2c0ef35858d13971319b0a7436ce2293824ce src/main/python/apache/thermos/core/helper.py 68855e1e54ba1cd4456e18a36fb237ce6a468c34 src/main/python/apache/thermos/core/process.py 3ec43e2719ef97026f399c4b2aa23002559b3153 src/main/python/apache/thermos/core/runner.py 7b9013d11f6ff4172b6b7bf56e62299b0d11c977 Diff: https://reviews.apache.org/r/53403/diff/ Testing ------- no automated tests yet. Validated behaviour with `ps` and `strace`. Thanks, Zameer Manji