MESOS-695 / Automated self-healing and coordinated repair to Mesos

Tom Arnfeld Fri, 16 May 2014 10:21:36 -0700

Hi all,

Wasn’t sure if it was right to start this thread on the JIRA issue.. I just 
came across MESOS-695 (and what seems to be something almost finished!) about 
implementing some kind of self-healing mechanism in mesos, and also picked up 
on mentions of monit. From what I could tell based on the comments a while 
back, Twitter uses monit for health checking the slaves and monit will take 
over and restart the slave process if something funky is going on.


I’m a big fan of monit, so this peaks my interest...

1) I’d be interested in knowing what monit rules are defined for a “failing” or 
misbehaving slave, if this can be shared, or a correction on how monit is being 
used with mesos at Twitter.
2) This may already exist outside the community, but has there been discussion 
of writing a monit plugin to achieve this? This way you not only have a way of 
telling monit the slave needs a restart, but alerting comes free with it.

I’m not too familiar with the implementation of this self-healing mechanism, 
but I assume one benefit of it being implemented in/around the master process 
is that it can gain a much wider view of what “misbehaving” means, in relation 
to all nodes in the cluster. The monitoring being outside-in rather than 
inside-out, a little similar to Hadoop’s blacklisting feature…?

Thanks!

Tom.

MESOS-695 / Automated self-healing and coordinated repair to Mesos

Reply via email to