> On May 19, 2017, at 11:35 AM, Zhitao Li <zhitaoli...@gmail.com> wrote:
> 
> Hi,
> 
> I'd like to start a conversation to talk about metrics collection endpoints
> (especially `/metrics/snapshot`) behavior.
> 
> Right now, these endpoints are served from the same master/agent's
> libprocess, and extensively uses `gauge` to chain further callbacks to
> collect various metrics (DRF allocator specifically adds several metrics
> per role).
> 
> This brings a problem when the system is under load: when the
> master/allocator libprocess becomes busy, stats collection itself becomes
> slow too. Flying dark when the system is under load is specifically painful
> for an operator.

Yes, sampling metrics should approach zero cost.

> I would like to explore the direction of isolating metric collection even
> when the master is slow. A couple of ideas:
> 
> - (short term) reduce usage of gauge and prefer counter (since I believe
> they are less affected);

I'd rather not squash the semantics for performance reasons. If a metric has 
gauge semantics, I don't think we should represent that as a Counter.

> - alternative implementation of `gauge` which does not contend on
> master/allocator's event queue;

This is doable in some circumstances, but not always. For example, 
Master::_uptime_secs() doesn't need to run on the master queue, but 
Master::_outstanding_offers arguably does. The latter could be implemented by 
sampling an variable that is updated, but that's not very generic, so we should 
try to think of something better.

> - serving metrics collection from a different libprocess routine.

See MetricsProcess. One (mitigation?) approach would be to sample the metrics 
at a fixed rate and then serve the cached samples from the MetricsProcess. I 
expect most installations have multiple clients sampling the metrics, so this 
would at least decouple the sample rate from the metrics request rate.

> 
> Any thoughts on these?
> 
> -- 
> Cheers,
> 
> Zhitao Li

Reply via email to