+Charlie Tom, Charlie is heading up this work now so he can likely better speak to what's taking place on this ticket then I can at the moment.
--Jeff-- On Fri, May 16, 2014 at 9:05 AM, Tom Arnfeld <[email protected]> wrote: > Hi all, > > Wasn’t sure if it was right to start this thread on the JIRA issue.. I > just came across MESOS-695 (and what seems to be something almost > finished!) about implementing some kind of self-healing mechanism in mesos, > and also picked up on mentions of monit. From what I could tell based on > the comments a while back, Twitter uses monit for health checking the > slaves and monit will take over and restart the slave process if something > funky is going on. > > I’m a big fan of monit, so this peaks my interest... > > 1) I’d be interested in knowing what monit rules are defined for a > “failing” or misbehaving slave, if this can be shared, or a correction on > how monit is being used with mesos at Twitter. > 2) This may already exist outside the community, but has there been > discussion of writing a monit plugin to achieve this? This way you not only > have a way of telling monit the slave needs a restart, but alerting comes > free with it. > > I’m not too familiar with the implementation of this self-healing > mechanism, but I assume one benefit of it being implemented in/around the > master process is that it can gain a much wider view of what “misbehaving” > means, in relation to all nodes in the cluster. The monitoring being > outside-in rather than inside-out, a little similar to Hadoop’s > blacklisting feature…? > > Thanks! > > Tom. > >
