Hi all, Wasn’t sure if it was right to start this thread on the JIRA issue.. I just came across MESOS-695 (and what seems to be something almost finished!) about implementing some kind of self-healing mechanism in mesos, and also picked up on mentions of monit. From what I could tell based on the comments a while back, Twitter uses monit for health checking the slaves and monit will take over and restart the slave process if something funky is going on.
I’m a big fan of monit, so this peaks my interest... 1) I’d be interested in knowing what monit rules are defined for a “failing” or misbehaving slave, if this can be shared, or a correction on how monit is being used with mesos at Twitter. 2) This may already exist outside the community, but has there been discussion of writing a monit plugin to achieve this? This way you not only have a way of telling monit the slave needs a restart, but alerting comes free with it. I’m not too familiar with the implementation of this self-healing mechanism, but I assume one benefit of it being implemented in/around the master process is that it can gain a much wider view of what “misbehaving” means, in relation to all nodes in the cluster. The monitoring being outside-in rather than inside-out, a little similar to Hadoop’s blacklisting feature…? Thanks! Tom.