[ 
https://issues.apache.org/jira/browse/AURORA-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077370#comment-16077370
 ] 

Reza Motamedi commented on AURORA-1941:
---------------------------------------

I did not think about this in advance. I guess there is no way to kill a 
process and have it exit with exit code 0.

> Cause container restart when a process is killed with a signal.
> ---------------------------------------------------------------
>
>                 Key: AURORA-1941
>                 URL: https://issues.apache.org/jira/browse/AURORA-1941
>             Project: Aurora
>          Issue Type: Task
>            Reporter: Reza Motamedi
>            Priority: Minor
>
> Say you have the following task config. Note all processes have max_failure = 
> 1.
> {code}
> {
>     "processes": [
>         {
>             "daemon": false, 
>             "name": "hello-0", 
>             "max_failures": 1, 
>             "ephemeral": false, 
>             "min_duration": 5, 
>             "cmdline": "while true; do echo `date`; sleep 60; done", 
>             "final": false
>         }, 
>         {
>             "daemon": false, 
>             "name": "hello-1", 
>             "max_failures": 1, 
>             "ephemeral": false, 
>             "min_duration": 5, 
>             "cmdline": "while true; do echo `date`; sleep 60; done", 
>             "final": false
>         }, 
>         {
>             "daemon": false, 
>             "name": "hello-2", 
>             "max_failures": 1, 
>             "ephemeral": false, 
>             "min_duration": 5, 
>             "cmdline": "while true; do echo `date`; sleep 60; done", 
>             "final": false
>         }
>     ], 
>     "name": "hello-0", 
>     "finalization_wait": 30, 
>     "max_failures": 1, 
>     "max_concurrency": 0, 
>     "resources": {
>         "gpu": 0, 
>         "disk": 16777216, 
>         "ram": 1048576, 
>         "cpu": 0.1
>     }, 
>     "constraints": []
> }
> {code}
> Say we kill one these thermos processes. In this case, the process gets 
> restarted since it technically did not crash/fail. Even if you kill it with 
> `kill -SIGSEGV <pid>` it still comes back up again and the number of failures 
> is 0. This is being registered as the process being lost and that number 
> correctly increases.
> I think it makes sense to check the exit code on a process kill and count it 
> a failure the err code is not `0`.
> Note that if one the processes fails / crashes it is handled differently:
> - on_killed
> {noformat}
> D0706 18:38:32.944282 12808 runner.py:156] Process on_killed 
> ProcessStatus(seq=3, process='hello-2', start_time=None, 
> coordinator_pid=None, pid=None, return_code=-9, state=4, 
> stop_time=1499366312.421471, fork_time=None)
> {noformat}
> - on_failed
> {noformat}
> D0706 22:37:14.829272 23216 runner.py:138] Process on_failed 
> ProcessStatus(seq=3, process='hello-bad', start_time=None, 
> coordinator_pid=None, pid=None, return_code=139, state=5, 
> stop_time=1499380634.768661, fork_time=None)
> {noformat}
> We can just check the `ProcessStatus.return_code` and act accordingly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to