[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834118#comment-16834118 ]
Greg Mann commented on MESOS-9767: ---------------------------------- [~ggarg] thanks for the info! Did the master respond to requests for the {{/master/health}} endpoint while it was stuck? > Add self health monitoring in Mesos master > ------------------------------------------ > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master > Affects Versions: 1.6.0 > Reporter: Gaurav Garg > Priority: Major > Fix For: 1.7.2 > > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the > [HealthChecker|[https://github.com/apache/mesos/blob/master/src/checks/health_checker.hpp]] > for this one. One downside is that HealthChecker currently takes TaskId as > an input which is not valid for master health check. > We can add following flags to control the self heath checking: > # self_monitoring_enabled: Whether self monitoring is enabled. > # self_monitoring_consecutive_failures: After this many number of health > failures, master is crashed. > # self_monitoring_interval_secs: Interval at which health checks are > performed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)