-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/44657/
-----------------------------------------------------------

(Updated March 23, 2016, 9:45 p.m.)


Review request for mesos, Ben Mahler and Gilbert Song.


Changes
-------

Manually tested the `killTask` path.


Bugs: MESOS-4909
    https://issues.apache.org/jira/browse/MESOS-4909


Repository: mesos


Description
-------

The command executor determines how much time it allots the
underlying task to clean up (effectively how long to wait for
the task to comply to SIGTERM before sending SIGKILL) based
on both optional task's `KillPolicy` and optional
`shutdown_grace_period` field in `ExecutorInfo`.

Manual testing was performed to ensure newly introduced protobuf
fields are respected. To do that, "mesos-execute" was modified to
support `KillPolicy` and `CommandInfo.shell=false`. To simulate a
task that does not exit in the allotted period, a tiny app
(https://github.com/rukletsov/unresponsive-process) that ignores
SIGTERM was used.


Diffs
-----

  src/launcher/executor.cpp 2df62f09637b55c63ae6fe5d5a70b8debc02fbe2 
  src/slave/slave.cpp 840534ff0687e82ed063c386e36bbabada230697 

Diff: https://reviews.apache.org/r/44657/diff/


Testing (updated)
-------

To test the newly introduced protobuf fields, I've modified "mesos-execute" to
support these fields. I've inserted these lines in `resourceOffers()` method:
```
task.mutable_kill_policy()->mutable_grace_period()->
    set_nanoseconds(Seconds(<value>).ns());
```

I've also implemented a "--no-shell" flag to support
setting `CommandInfo.shell` to `false`. An example command used in testing:
```
./src/mesos-execute --master=127.0.0.1:5050 --name=test 
--command="/Users/alex/bin/unresponsive_process" --no-shell
```

In order to test the signal escalation path, which is what the command executor
implements currently for kill policy, I've created a tiny application that
ignores SIGTERM (https://github.com/rukletsov/unresponsive-process ).

Test 1.1:
=========
`KillPolicy.grace_period` is **not set**, `shell=false`, framework shutdown.
Framework is asked to shutdown several seconds after the task is started. This
results in agent asking executor to shutdown. From the agent's logs, we see that
the command executor sends a TASK_KILLED update in approx. **3 seconds** (the
default value is **5 seconds**, but we deduct some buffer to make sure the task 
can
be reaped and TASK_KILLED is sent) after it has been asked to shut down:
```
I0323 17:24:48.704677 418881536 slave.cpp:2080] Asked to shut down framework 
9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 by master@127.0.0.1:5050
I0323 17:24:48.704710 418881536 slave.cpp:2105] Shutting down framework 
9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000
I0323 17:24:48.704756 418881536 slave.cpp:4216] Shutting down executor 'test' 
of framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 at 
executor(1)@192.168.178.24:50933
I0323 17:24:51.817987 419418112 slave.cpp:3003] Handling status update 
TASK_KILLED (UUID: 9cf799dd-3492-42c2-9b43-fcb7283ebec9) for task test of 
framework 9eb233b5-dfe8-4030-a989-57f31ec0e3aa-0000 from 
executor(1)@192.168.178.24:50933
```

Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has
been called in 3 seconds:
```
Shutting down
Sending SIGTERM to process tree at pid 23632
9276662428399220356
Sent SIGTERM to the following process trees:
[
--- 23632 /Users/alex/bin/unresponsive_process
]
10137513391590427691
12741826412569908689
Process 23632 did not terminate after 3secs, sending SIGKILL to process tree at 
23632
16972857197258033356
Killed the following process trees:
[
--- 23632 /Users/alex/bin/unresponsive_process
]
Command terminated with signal Killed: 9 (pid: 23632)
```

Test 1.2:
=========
`KillPolicy.grace_period` is **not set**, `shell=true`, framework shutdown.
The only difference to test 1.1 is `shell=true`. Similar behavior is observed.

Test 2.1:
=========
`KillPolicy.grace_period` set to **0s**, `shell=false`, framework shutdown.
Similar to test 1.1, but the escalation timeout is set to 0s (no deductions
because the kill policy is set explicitly; the grace period in containerizer
should be adjusted to approx. 2 seconds):
```
I0323 17:38:07.650616 189734912 slave.cpp:2080] Asked to shut down framework 
0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 by master@127.0.0.1:5050
I0323 17:38:07.650647 189734912 slave.cpp:2105] Shutting down framework 
0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000
I0323 17:38:07.650686 189734912 slave.cpp:4216] Shutting down executor 'test' 
of framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 at 
executor(1)@192.168.178.24:51836
I0323 17:38:07.823477 187052032 slave.cpp:3003] Handling status update 
TASK_KILLED (UUID: 56aad190-2ea2-4bb2-8d61-7a49ff42d4ba) for task test of 
framework 0dbc9e01-89ef-4cff-8c64-cdb3b46e9b2a-0000 from 
executor(1)@192.168.178.24:51836
```

Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has
been called in 0 seconds:
```
Shutting down
Sending SIGTERM to process tree at pid 24075
Sent SIGTERM to the following process trees:
[
10370891801885978953
--- 24075 /Users/alex/bin/unresponsive_process
]
Process 24075 did not terminate after 0ns, sending SIGKILL to process tree at 
24075
Killed the following process trees:
[
--- 24075 /Users/alex/bin/unresponsive_process
]
Command terminated with signal Killed: 9 (pid: 24075)
```

Test 2.2:
=========
`KillPolicy.grace_period` set to **0s**, `shell=true`, framework shutdown
The only difference to test 2.1 is `shell=true`. Similar behavior is observed.

Test 3.1:
=========
`KillPolicy.grace_period` set to **11s**, `shell=false`, framework shutdown
Similar to test 1.1, but the escalation timeout is set to 11 seconds
(no deductions because the kill policy is set explicitly):
```
I0323 16:59:19.218186 189407232 slave.cpp:2105] Shutting down framework 
8bd2df7e-5659-4887-800e-604db3694e17-0000
I0323 16:59:19.218214 189407232 slave.cpp:4216] Shutting down executor 'test' 
of framework 8bd2df7e-5659-4887-800e-604db3694e17-0000 at 
executor(1)@192.168.178.24:65526
I0323 16:59:30.353431 190480384 slave.cpp:3003] Handling status update 
TASK_KILLED (UUID: e7a5fcfb-33a2-4341-8ca4-d372c754020c) for task test of 
framework 8bd2df7e-5659-4887-800e-604db3694e17-0000 from 
executor(1)@192.168.178.24:65526
```

Excerpt from the executor's log we see, that `CommandExecutor::escalated()` has
been called in 11 seconds:
```
Shutting down
Sending SIGTERM to process tree at pid 23173
Sent SIGTERM to the following process trees:
5447330384702360843
[
--- 23173 /Users/alex/bin/unresponsive_process
]
9421153335942970033
12365643494559189178
7584901102993551143
12120587866343901205
9383577596543058034
10156641407552912845
12050873774172063028
11004056749756330400
13931472820830639858
16620123046684546445
Process 23173 did not terminate after 11secs, sending SIGKILL to process tree 
at 23173
Killed the following process trees:
[
--- 23173 /Users/alex/bin/unresponsive_process
]
Command terminated with signal Killed: 9 (pid: 23173)
```

Test 3.2:
=========
`KillPolicy.grace_period` set to **11s**, `shell=true`, framework shutdown
The only difference to test 3.1 is `shell=true`. Similar behavior is observed.

Test 3.3:
=========
`KillPolicy.grace_period` set to **11s**, `shell=false`, `killTask`
Similar to test 3.1, but instead of framework shutdown, `killTask()` is called.
The difference is the executor's log, which indicates the `killTask` path is
being executed:
```
Received killTask
Shutting down
Sending SIGTERM to process tree at pid 25463
Sent SIGTERM to the following process trees:
[ 
--- 25463 /Users/alex/bin/unresponsive_process
]
17172602460659762152
4381544758593790168
10370891801885978953
4918013092589539175
9276662428399220356
10137513391590427691
12741826412569908689
16972857197258033356
3581100400388580249
4936931647359169578
14723711275895450681
Process 25463 did not terminate after 11secs, sending SIGKILL to process tree 
at 25463
Killed the following process trees:
[ 
--- 25463 /Users/alex/bin/unresponsive_process
]
Command terminated with signal Killed: 9 (pid: 25463)
Shutting down
```


Thanks,

Alexander Rukletsov

Reply via email to