> On May 19, 2017, at 11:35 AM, Zhitao Li <zhitaoli...@gmail.com> wrote: > > Hi, > > I'd like to start a conversation to talk about metrics collection endpoints > (especially `/metrics/snapshot`) behavior. > > Right now, these endpoints are served from the same master/agent's > libprocess, and extensively uses `gauge` to chain further callbacks to > collect various metrics (DRF allocator specifically adds several metrics > per role). > > This brings a problem when the system is under load: when the > master/allocator libprocess becomes busy, stats collection itself becomes > slow too. Flying dark when the system is under load is specifically painful > for an operator.
Yes, sampling metrics should approach zero cost. > I would like to explore the direction of isolating metric collection even > when the master is slow. A couple of ideas: > > - (short term) reduce usage of gauge and prefer counter (since I believe > they are less affected); I'd rather not squash the semantics for performance reasons. If a metric has gauge semantics, I don't think we should represent that as a Counter. > - alternative implementation of `gauge` which does not contend on > master/allocator's event queue; This is doable in some circumstances, but not always. For example, Master::_uptime_secs() doesn't need to run on the master queue, but Master::_outstanding_offers arguably does. The latter could be implemented by sampling an variable that is updated, but that's not very generic, so we should try to think of something better. > - serving metrics collection from a different libprocess routine. See MetricsProcess. One (mitigation?) approach would be to sample the metrics at a fixed rate and then serve the cached samples from the MetricsProcess. I expect most installations have multiple clients sampling the metrics, so this would at least decouple the sample rate from the metrics request rate. > > Any thoughts on these? > > -- > Cheers, > > Zhitao Li