Re: Review Request 65954: Add a gauge for how long agent recovery takes.

Zhitao Li Tue, 13 Mar 2018 17:05:49 -0700


> On March 9, 2018, 6:53 p.m., James Peach wrote:
> > src/slave/metrics.cpp
> > Lines 259 (patched)
> > <https://reviews.apache.org/r/65954/diff/2/?file=1972384#file1972384line259>
> >
> >     I don't know that I like the idea of a metric that is absent and then 
> > present. I'd prefer that we just published a `0.0` until recovert is 
> > complete.
> >     
> >     Suggest we keep the recovery timestamp in the `Slave` and just publish 
> > that.


I thought about that too, but I actually like the idea of the metric being 
absent when the value is not available yet. A zero value could confuse 
downstream aggregation.

For example, our team want to gather an average of recovery time across our 
cluster of thousands of agents, but a presence of zero value could mistake the 
calculation.

I think Mesos already have some precedence on absent then present metrics. For 
instance, metrics in `allocator/mesos/roles/<role>/...` could show up if 
framework under a new role registers after Master started.

Let me know what do you think.


> On March 9, 2018, 6:53 p.m., James Peach wrote:
> > src/slave/slave.cpp
> > Lines 7322 (patched)
> > <https://reviews.apache.org/r/65954/diff/2/?file=1972385#file1972385line7322>
> >
> >     Since the gauge is being published in seconds, you need to use 
> > `Duration::secs` to convert.

I prefer the API call to work on `Duration` and perform the `secs()` as late as 
possible, as I've seen so many times when people pass a wrong time unit if the 
API task an integer/float.


- Zhitao


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65954/#review198952
-----------------------------------------------------------


On March 7, 2018, 11:20 p.m., Zhitao Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65954/
> -----------------------------------------------------------
> 
> (Updated March 7, 2018, 11:20 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, Greg Mann, Jason Lai, and James Peach.
> 
> 
> Bugs: MESOS-8609
>     https://issues.apache.org/jira/browse/MESOS-8609
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The new metric `slave/recover_secs` can be used to tell us how long
> Mesos agent needed to finish its recovery cycle. This is an important
> metric on agent machines which have a lot of completed executor
> sandboxes.
> 
> Note that the metric 1) will only be available after recovery succeeded
> and 2) never change its value across agent process lifecycle afterwards.
> 
> 
> Diffs
> -----
> 
>   src/slave/metrics.hpp 3fc933ca65690d6fad63156398ad9c2c53789296 
>   src/slave/metrics.cpp 0eb2b59ed67e14e73b29d7592c239441df0008d5 
>   src/slave/slave.cpp e2facb3c15a2f907f6497c58a36842ed707f2c70 
> 
> 
> Diff: https://reviews.apache.org/r/65954/diff/2/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>

Re: Review Request 65954: Add a gauge for how long agent recovery takes.

Reply via email to