> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> >     If we use equality operator and only set the timer when such a number 
> > of reregistered agent is reached we are guaranteed to only set each timer 
> > once (but we may need to set multiple timers in one call) right? This 
> > alleviates the need to check if the timer is already set!
> >     
> >     This should also work with boundary cases like the total recovered 
> > agents count being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

This is actaully fixed and marked as dropped somehow. We used equality operator 
to 0.0 to check whether the percentile was reached or not; The reason we used 
push gauge not timer is explained in push gauge vs timer comments section


> On Oct. 18, 2018, 6:14 p.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> >     On using Timer (e.g., like 
> > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
> >  vs. PushGauge, after looking at how it'll be used I think the main 
> > advantage of Timer is that it doesn't export any value if you haven't set 
> > it.
> >     
> >     Consider the two cases:
> >     
> >     1. There are 1000 recovered agents and 0 have reregsitered, should all 
> > the timers have zero values or should they be absent?
> >     
> >     2. There are 0 recovered agents (e.g., brand new cluster), should all 
> > of the metrics be zero or non-existent? I feel like they should be zero, as 
> > in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> >     
> >     So timer handles this natrually. Also it sets the `_secs` name for you 
> > but that's a minor conveninence.
> 
> Jiang Yan Xu wrote:
>     We should comment on the reasons for dropping issues.

Sorry about this, I did make the comments but it must be in one of draft which 
was not saved. 

I agree that metrics should be zero but not absent when certain percentige were 
not reached. I did tried both PushGauge and Timer implementation and tested in 
our clusters.

If we used the timer, when the value was not set, that metric is actually 
missing. PushGauge is set with the initial value 0.0 and we can tell whether 
it's set yet, the metric will always exist no matter that percentile reached or 
not, and it has better performance.

The brand new cluster case was tested and the results were included in the test 
results.


- Xudong


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
-----------------------------------------------------------


On Oct. 19, 2018, 11:56 p.m., Xudong Ni wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------------------------------------------------------
> 
> (Updated Oct. 19, 2018, 11:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
>     https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -----
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> -------
> 
> Automation:
> [ RUN      ] MasterTest.MetricsInMetricsEndpoint
> [       OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 5.0,
> 
> 
> While 6 reregistrations were all completed:
> "master/recovered_agents_100_percent_reregistered_secs": 22.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovered_agents_90_percent_reregistered_secs": 22.0,
> "master/recovered_agents_99_percent_reregistered_secs": 22.0,
> "master/slave_reregistrations": 6.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Reply via email to