Re: MESOS-695 / Automated self-healing and coordinated repair to Mesos

Jeff Currier Fri, 16 May 2014 10:06:48 -0700

+Charlie

Tom, Charlie is heading up this work now so he can likely better speak to
what's taking place on this ticket then I can at the moment.


--Jeff--


On Fri, May 16, 2014 at 9:05 AM, Tom Arnfeld <[email protected]> wrote:

> Hi all,
>
> Wasn’t sure if it was right to start this thread on the JIRA issue.. I
> just came across MESOS-695 (and what seems to be something almost
> finished!) about implementing some kind of self-healing mechanism in mesos,
> and also picked up on mentions of monit. From what I could tell based on
> the comments a while back, Twitter uses monit for health checking the
> slaves and monit will take over and restart the slave process if something
> funky is going on.
>
> I’m a big fan of monit, so this peaks my interest...
>
> 1) I’d be interested in knowing what monit rules are defined for a
> “failing” or misbehaving slave, if this can be shared, or a correction on
> how monit is being used with mesos at Twitter.
> 2) This may already exist outside the community, but has there been
> discussion of writing a monit plugin to achieve this? This way you not only
> have a way of telling monit the slave needs a restart, but alerting comes
> free with it.
>
> I’m not too familiar with the implementation of this self-healing
> mechanism, but I assume one benefit of it being implemented in/around the
> master process is that it can gain a much wider view of what “misbehaving”
> means, in relation to all nodes in the cluster. The monitoring being
> outside-in rather than inside-out, a little similar to Hadoop’s
> blacklisting feature…?
>
> Thanks!
>
> Tom.
>
>

Re: MESOS-695 / Automated self-healing and coordinated repair to Mesos

Reply via email to