Will Rouesnel created MESOS-6358:
------------------------------------
Summary: Add watchdog timeout/action for mesos tasks which do not
exit
Key: MESOS-6358
URL: https://issues.apache.org/jira/browse/MESOS-6358
Project: Mesos
Issue Type: Improvement
Components: slave
Reporter: Will Rouesnel
Priority: Minor
When running with the docker containerizer, we've observed the scenario where a
subproces of the docker container becomes a zombie due to a kernel bug (i.e. is
completely unkillable).
The effect of this was that Mesos kept reporting the task as running via it's
API but not as existing to calls to delete it (being made by Marathon) as the
actual docker-runc/docker-container process never exited (since none of the
child processes exited waiting on the misbehaving subprocess).
Mesos should include a parameter to deal with this situation - I woudl propose
--task_kill_watchdog_timeout and --task_kill_watchdog_binary
The idea would be that if a task still exists beyond the the length of timeout
*after* the hard kill signal has been sent, then Mesos executes the watchdog
binary action.
The usage of such a process would be to allow alerting of an exceptional
situation (if possible) and actioning to ensure worker nodes can stay up on
average (i.e. if encountered, crash the node, let a BMC watchdog reboot it).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)