Re: MESOS-695 / Automated self-healing and coordinated repair to Mesos

Vinod Kone Mon, 19 May 2014 10:29:58 -0700

Monit usage for Mesos @Twitter is very basic. For both master and slave,
monit pings a known endpoint (/health or /stats.json) and restarts the
process if it fails to respond within a timeout (with retries).


The motivation for self-healing is that it is co-ordinated via master (as
you alluded to). It could be more flexible and pluggable. An example
self-healing action of a non-responsive slave process could to be to first
deactivate the slave (no more offers sent), send a restart command and if
that doesn't work re-image the host.


On Fri, May 16, 2014 at 9:05 AM, Tom Arnfeld <t...@duedil.com> wrote:

> Hi all,
>
> Wasn’t sure if it was right to start this thread on the JIRA issue.. I
> just came across MESOS-695 (and what seems to be something almost
> finished!) about implementing some kind of self-healing mechanism in mesos,
> and also picked up on mentions of monit. From what I could tell based on
> the comments a while back, Twitter uses monit for health checking the
> slaves and monit will take over and restart the slave process if something
> funky is going on.
>
> I’m a big fan of monit, so this peaks my interest...
>
> 1) I’d be interested in knowing what monit rules are defined for a
> “failing” or misbehaving slave, if this can be shared, or a correction on
> how monit is being used with mesos at Twitter.
> 2) This may already exist outside the community, but has there been
> discussion of writing a monit plugin to achieve this? This way you not only
> have a way of telling monit the slave needs a restart, but alerting comes
> free with it.
>
> I’m not too familiar with the implementation of this self-healing
> mechanism, but I assume one benefit of it being implemented in/around the
> master process is that it can gain a much wider view of what “misbehaving”
> means, in relation to all nodes in the cluster. The monitoring being
> outside-in rather than inside-out, a little similar to Hadoop’s
> blacklisting feature…?
>
> Thanks!
>
> Tom.
>
>

Re: MESOS-695 / Automated self-healing and coordinated repair to Mesos

Reply via email to