-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53403/
-----------------------------------------------------------

(Updated Nov. 2, 2016, 1:32 p.m.)


Review request for Aurora, Joshua Cohen, Santhosh Kumar Shanmugham, and Stephan 
Erb.


Bugs: AURORA-1808
    https://issues.apache.org/jira/browse/AURORA-1808


Repository: aurora


Description (updated)
-------

This is a WIP patch showing a possible fix to AURORA-1808.

# Problem

Processes can deamonize and escape the supervision of a coordinator. Using the 
Docker Containerizer or the Mesos Containerizer with pid isolation means that 
the processes will be come reparented to the `sh` process that launches the 
executor. For example:
````
root@aurora:/# ps xf
  PID TTY      STAT   TIME COMMAND
   48 ?        Ss     0:00 /bin/bash
   86 ?        R+     0:00  _ ps xf
    1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
/home/vagrant/aurora/examples/va
    5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
/home/vagrant/aurora/examples/vag
   23 ?        S      0:00  _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be152
 --
   29 ?        Ss     0:00      _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
   32 ?        S      0:00      |   _ /bin/bash -c      while true; do       
echo hello world       sleep 10     done
   81 ?        S      0:00      |       _ sleep 10
   31 ?        Ss     0:00      _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-bde5cdc7-8685-46fd-9078-4a86bd5be15
   33 ?        S      0:00          _ /bin/bash -c      while true; do       
echo hello world       sleep 10     done
   82 ?        S      0:00              _ sleep 10
   47 ?        S      0:00 python ./daemon.py
````

# Solution
Ensure processes that escape the supervision of the coordinator reparent to the 
runner who can send signals to them on task tear down. We do this by using the 
`PR_SET_CHILD_SUBREAPER` flag of `prctl(2)`.

After this change the process tree looks like:
````
root@aurora:/# ps xf
  PID TTY      STAT   TIME COMMAND
   66 ?        Ss     0:00 /bin/bash
   70 ?        R+     0:00  _ ps xf
    1 ?        Ss     0:00 /bin/sh -c ${MESOS_SANDBOX=.}/thermos_executor.pex 
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
/home/vagrant/aurora/examples/va
    5 ?        Sl     0:02 python2.7 /mnt/mesos/sandbox/thermos_executor.pex 
--announcer-ensemble localhost:2181 --announcer-zookeeper-auth-config 
/home/vagrant/aurora/examples/vag
   23 ?        S      0:00  _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b849
 --
   33 ?        Ss     0:00      _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
   40 ?        S      0:00      |   _ /bin/bash -c      while true; do       
echo hello world       sleep 10     done
   63 ?        S      0:00      |       _ sleep 10
   36 ?        Ss     0:00      _ /usr/local/bin/python2.7 
/mnt/mesos/sandbox/thermos_runner.pex 
--task_id=www-data-devel-hello_docker_engine-0-721406db-00f5-4c0c-915e-1dbc5568b84
   37 ?        S      0:00      |   _ /bin/bash -c      while true; do       
echo hello world       sleep 10     done
   62 ?        S      0:00      |       _ sleep 10
   55 ?        S      0:00      _ python ./daemon.py
````

Now the runner is aware of the reparented procesess can can tear it down 
cleanly during teardown.

Note that the man page for `prctl(2)` says that the processes that set 
`PR_SET_CHILD_SUBREAPER` should reap children to get rid of zombies. It is 
important to note tht the runner already does this in its run loop via 
`TaskRunnerHelper.reap_children()`. This patch has the side effect of ensuring 
it will reap all of the children launched via coordinators.


Diffs
-----

  src/main/python/apache/thermos/common/process_util.py 
abd2c0ef35858d13971319b0a7436ce2293824ce 
  src/main/python/apache/thermos/core/helper.py 
68855e1e54ba1cd4456e18a36fb237ce6a468c34 
  src/main/python/apache/thermos/core/process.py 
3ec43e2719ef97026f399c4b2aa23002559b3153 
  src/main/python/apache/thermos/core/runner.py 
7b9013d11f6ff4172b6b7bf56e62299b0d11c977 

Diff: https://reviews.apache.org/r/53403/diff/


Testing
-------

no automated tests yet.

Validated behaviour with `ps` and `strace`.


Thanks,

Zameer Manji

Reply via email to