[ https://issues.apache.org/jira/browse/AURORA-1941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reza Motamedi closed AURORA-1941. --------------------------------- Resolution: Not A Problem This does not seem to be a problem. It has been like it by design. > Cause container restart when a process is killed with a signal. > --------------------------------------------------------------- > > Key: AURORA-1941 > URL: https://issues.apache.org/jira/browse/AURORA-1941 > Project: Aurora > Issue Type: Task > Reporter: Reza Motamedi > Priority: Minor > > Say you have the following task config. Note all processes have max_failure = > 1. > {code} > { > "processes": [ > { > "daemon": false, > "name": "hello-0", > "max_failures": 1, > "ephemeral": false, > "min_duration": 5, > "cmdline": "while true; do echo `date`; sleep 60; done", > "final": false > }, > { > "daemon": false, > "name": "hello-1", > "max_failures": 1, > "ephemeral": false, > "min_duration": 5, > "cmdline": "while true; do echo `date`; sleep 60; done", > "final": false > }, > { > "daemon": false, > "name": "hello-2", > "max_failures": 1, > "ephemeral": false, > "min_duration": 5, > "cmdline": "while true; do echo `date`; sleep 60; done", > "final": false > } > ], > "name": "hello-0", > "finalization_wait": 30, > "max_failures": 1, > "max_concurrency": 0, > "resources": { > "gpu": 0, > "disk": 16777216, > "ram": 1048576, > "cpu": 0.1 > }, > "constraints": [] > } > {code} > Say we kill one these thermos processes. In this case, the process gets > restarted since it technically did not crash/fail. Even if you kill it with > `kill -SIGSEGV <pid>` it still comes back up again and the number of failures > is 0. This is being registered as the process being lost and that number > correctly increases. > I think it makes sense to check the exit code on a process kill and count it > a failure the err code is not `0`. > Note that if one the processes fails / crashes it is handled differently: > - on_killed > {noformat} > D0706 18:38:32.944282 12808 runner.py:156] Process on_killed > ProcessStatus(seq=3, process='hello-2', start_time=None, > coordinator_pid=None, pid=None, return_code=-9, state=4, > stop_time=1499366312.421471, fork_time=None) > {noformat} > - on_failed > {noformat} > D0706 22:37:14.829272 23216 runner.py:138] Process on_failed > ProcessStatus(seq=3, process='hello-bad', start_time=None, > coordinator_pid=None, pid=None, return_code=139, state=5, > stop_time=1499380634.768661, fork_time=None) > {noformat} > We can just check the `ProcessStatus.return_code` and act accordingly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)