[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

Benjamin Mahler (JIRA) Wed, 22 Aug 2018 14:42:02 -0700


    [ 
https://issues.apache.org/jira/browse/MESOS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589375#comment-16589375
 ]


Benjamin Mahler commented on MESOS-9178:
----------------------------------------

Such a metric would be rather brittle, you only need 1 agent to not be able to 
re-register after a master failover for it to be useless. I would love to see 
some alternatives explored here, e.g.

We could have some progress oriented metrics:
* Time taken for failed over master to register (25%, 50%, 75%, 90%, 99% 100%) 
of agents. The metric described in this ticket would be the 100% case, but for 
most users, they'll probably monitor on a lower percentage.

> Add a metric for master failover time.
> --------------------------------------
>
>                 Key: MESOS-9178
>                 URL: https://issues.apache.org/jira/browse/MESOS-9178
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>            Reporter: Xudong Ni
>            Assignee: Xudong Ni
>            Priority: Minor
>
> Quote from Yan Xu: Previous the argument against it is that you don't know if 
> all agents are going to come back after a master failover so there's not a 
> certain point that marks the end of "full reregistration of all agents". 
> However empirically the number of agents usually don't change during the 
> failover and there's an upper bound of such wait (after a 10min timeout the 
> agents that haven't reregistered are going to be marked unreachable so we can 
> just use that to stop the timer.
> So we can define failover time as "the time it takes for all agents recovered 
> from the registry to be accounted for" i.e., either reregistered or marked as 
> unreachable.
> This is of course looking at failover from an agent reregistration 
> perspective.
> Later after we add framework info persistence, we can similarly define the 
> framework perspective using reregistration time or reconciliation time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9178) Add a metric for master failover time.

Reply via email to