Gaurav Garg created MESOS-9767:
----------------------------------
Summary: Add self health monitoring in Mesos master
Key: MESOS-9767
URL: https://issues.apache.org/jira/browse/MESOS-9767
Project: Mesos
Issue Type: Task
Components: master
Affects Versions: 1.6.2
Reporter: Gaurav Garg
Fix For: 1.7.2
We have seen issue where Mesos master got stuck and was not responding to HTTP
endpoints like "/metrics/snapshot". This results in calls by the frameworks and
metrics collector to the master to hang. Currently we emit 'master alive'
metric using prometheus. If master hangs, this metrics is not published and we
detect the hangs using alerts on top of this metrics. By the time someone would
have got the alert and restarted the master process, 15-30mins would have
passed by. This results in SLA violation by Mesos cluster users.
It will be nice to implement a self health check monitoring to detect if the
Mesos master is hung/stuck. This will help us to quickly crash the master
process so that one of the other member of the quorum can acquire ZK leadership
lock.
We can use the "/master/health" endpoint for health checks.
Health checks can be initiated in
[src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
just after the child master process is
[spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
We can leverage the
[HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]]
for this one. One downside is that HealthChecker currently takes TaskId as an
input which is not valid for master health check.
We can add following flags to control the self heath checking:
# self_monitoring_enabled: Whether self monitoring is enabled.
# self_monitoring_consecutive_failures: After this many number of health
failures, master is crashed.
# self_monitoring_interval_secs: Interval at which health checks are performed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)