Will Rouesnel created MESOS-6358:
------------------------------------

             Summary: Add watchdog timeout/action for mesos tasks which do not 
exit
                 Key: MESOS-6358
                 URL: https://issues.apache.org/jira/browse/MESOS-6358
             Project: Mesos
          Issue Type: Improvement
          Components: slave
            Reporter: Will Rouesnel
            Priority: Minor


When running with the docker containerizer, we've observed the scenario where a 
subproces of the docker container becomes a zombie due to a kernel bug (i.e. is 
completely unkillable).

The effect of this was that Mesos kept reporting the task as running via it's 
API but not as existing to calls to delete it (being made by Marathon) as the 
actual docker-runc/docker-container process never exited (since none of the 
child processes exited waiting on the misbehaving subprocess).

Mesos should include a parameter to deal with this situation - I woudl propose 
--task_kill_watchdog_timeout and --task_kill_watchdog_binary

The idea would be that if a task still exists beyond the the length of timeout 
*after* the hard kill signal has been sent, then Mesos executes the watchdog 
binary action.

The usage of such a process would be to allow alerting of an exceptional 
situation (if possible) and actioning to ensure worker nodes can stay up on 
average (i.e. if encountered, crash the node, let a BMC watchdog reboot it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to