[ https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589375#comment-16589375 ]
Benjamin Mahler commented on MESOS-9178: ---------------------------------------- Such a metric would be rather brittle, you only need 1 agent to not be able to re-register after a master failover for it to be useless. I would love to see some alternatives explored here, e.g. We could have some progress oriented metrics: * Time taken for failed over master to register (25%, 50%, 75%, 90%, 99% 100%) of agents. The metric described in this ticket would be the 100% case, but for most users, they'll probably monitor on a lower percentage. > Add a metric for master failover time. > -------------------------------------- > > Key: MESOS-9178 > URL: https://issues.apache.org/jira/browse/MESOS-9178 > Project: Mesos > Issue Type: Improvement > Components: master > Reporter: Xudong Ni > Assignee: Xudong Ni > Priority: Minor > > Quote from Yan Xu: Previous the argument against it is that you don't know if > all agents are going to come back after a master failover so there's not a > certain point that marks the end of "full reregistration of all agents". > However empirically the number of agents usually don't change during the > failover and there's an upper bound of such wait (after a 10min timeout the > agents that haven't reregistered are going to be marked unreachable so we can > just use that to stop the timer. > So we can define failover time as "the time it takes for all agents recovered > from the registry to be accounted for" i.e., either reregistered or marked as > unreachable. > This is of course looking at failover from an agent reregistration > perspective. > Later after we add framework info persistence, we can similarly define the > framework perspective using reregistration time or reconciliation time. -- This message was sent by Atlassian JIRA (v7.6.3#76005)