Gaurav Garg created MESOS-9767:
----------------------------------

             Summary: Add self health monitoring in Mesos master
                 Key: MESOS-9767
                 URL: https://issues.apache.org/jira/browse/MESOS-9767
             Project: Mesos
          Issue Type: Task
          Components: master
    Affects Versions: 1.6.2
            Reporter: Gaurav Garg
             Fix For: 1.7.2


We have seen issue where Mesos master got stuck and was not responding to HTTP 
endpoints like "/metrics/snapshot". This results in calls by the frameworks and 
metrics collector to the master to hang. Currently we emit 'master alive' 
metric using prometheus. If master hangs, this metrics is not published and we 
detect the hangs using alerts on top of this metrics. By the time someone would 
have got the alert and restarted the master process, 15-30mins would have 
passed by. This results in SLA violation by Mesos cluster users.

It will be nice to implement a self health check monitoring to detect if the 
Mesos master is hung/stuck. This will help us to quickly crash the master 
process so that one of the other member of the quorum can acquire ZK leadership 
lock.



We can use the "/master/health" endpoint for health checks. 
Health checks can be initiated in 
[src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
 just after the child master process is 
[spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]

We can leverage the 
[HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
 for this one. One downside is that HealthChecker currently takes TaskId as an 
input which is not valid for master health check. 



We can add following flags to control the self heath checking:
 # self_monitoring_enabled: Whether self monitoring is enabled.
 # self_monitoring_consecutive_failures: After this many number of health 
failures, master is crashed.
 # self_monitoring_interval_secs: Interval at which health checks are performed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to