[ https://issues.apache.org/jira/browse/MESOS-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041458#comment-14041458 ]
Benjamin Mahler commented on MESOS-1503: ---------------------------------------- Glad to see you think through this carefully. Let's hold off on this change until MESOS-1529 is resolved as the ping/pong semantics may change as a result. > Improve slave health checking to prevent rapid widespread slave removals. > ------------------------------------------------------------------------- > > Key: MESOS-1503 > URL: https://issues.apache.org/jira/browse/MESOS-1503 > Project: Mesos > Issue Type: Improvement > Components: master > Reporter: Benjamin Mahler > Assignee: Timothy Chen > Labels: reliability > > Per some discussions with [~tweingartner] and [~vinodkone]. > Currently the master uses a SlaveObserver for each registered slave. Each > SlaveObserver operates independently and makes decisions about whether the > slave is healthy. > The independence of these observers means that in some very rare events (e.g. > masters are partitioned from 75% of slaves), the master can very rapidly > remove a large portion of the slaves in the cluster. Ideally such an event > could be deemed dangerous and throttled accordingly through a more > intelligent notion of overall cluster health. > It may be nice to have a single observer that is responsible for health > checking all the slaves. This will allow us to make safer decisions as to > when to determine that slaves are unhealthy. -- This message was sent by Atlassian JIRA (v6.2#6252)