Monit usage for Mesos @Twitter is very basic. For both master and slave, monit pings a known endpoint (/health or /stats.json) and restarts the process if it fails to respond within a timeout (with retries).
The motivation for self-healing is that it is co-ordinated via master (as you alluded to). It could be more flexible and pluggable. An example self-healing action of a non-responsive slave process could to be to first deactivate the slave (no more offers sent), send a restart command and if that doesn't work re-image the host. On Fri, May 16, 2014 at 9:05 AM, Tom Arnfeld <t...@duedil.com> wrote: > Hi all, > > Wasn’t sure if it was right to start this thread on the JIRA issue.. I > just came across MESOS-695 (and what seems to be something almost > finished!) about implementing some kind of self-healing mechanism in mesos, > and also picked up on mentions of monit. From what I could tell based on > the comments a while back, Twitter uses monit for health checking the > slaves and monit will take over and restart the slave process if something > funky is going on. > > I’m a big fan of monit, so this peaks my interest... > > 1) I’d be interested in knowing what monit rules are defined for a > “failing” or misbehaving slave, if this can be shared, or a correction on > how monit is being used with mesos at Twitter. > 2) This may already exist outside the community, but has there been > discussion of writing a monit plugin to achieve this? This way you not only > have a way of telling monit the slave needs a restart, but alerting comes > free with it. > > I’m not too familiar with the implementation of this self-healing > mechanism, but I assume one benefit of it being implemented in/around the > master process is that it can gain a much wider view of what “misbehaving” > means, in relation to all nodes in the cluster. The monitoring being > outside-in rather than inside-out, a little similar to Hadoop’s > blacklisting feature…? > > Thanks! > > Tom. > >