Re: Isolating metrics collection from master/agent slowness

Zhitao Li Mon, 22 May 2017 13:27:15 -0700

Thanks for the feedback, James.

Replying to your points inline:

On Mon, May 22, 2017 at 10:56 AM, James Peach <[email protected]> wrote:

>
> > On May 19, 2017, at 11:35 AM, Zhitao Li <[email protected]> wrote:
> >
> > Hi,
> >
> > I'd like to start a conversation to talk about metrics collection
> endpoints
> > (especially `/metrics/snapshot`) behavior.
> >
> > Right now, these endpoints are served from the same master/agent's
> > libprocess, and extensively uses `gauge` to chain further callbacks to
> > collect various metrics (DRF allocator specifically adds several metrics
> > per role).
> >
> > This brings a problem when the system is under load: when the
> > master/allocator libprocess becomes busy, stats collection itself becomes
> > slow too. Flying dark when the system is under load is specifically
> painful
> > for an operator.
>
> Yes, sampling metrics should approach zero cost.
>
> > I would like to explore the direction of isolating metric collection even
> > when the master is slow. A couple of ideas:
> >
> > - (short term) reduce usage of gauge and prefer counter (since I believe
> > they are less affected);
>
> I'd rather not squash the semantics for performance reasons. If a metric
> has gauge semantics, I don't think we should represent that as a Counter.
>

I recall that I had a previous conversation with @bmahler and he thought
that certain gauges could be expresses as the differential of two counters.

We definitely cannot express a gauge as a Counter, because gauge value can
decrease while counter should always be treated as monotonically increasing
until process restart.

>
> > - alternative implementation of `gauge` which does not contend on
> > master/allocator's event queue;
>
> This is doable in some circumstances, but not always. For example,
> Master::_uptime_secs() doesn't need to run on the master queue, but
> Master::_outstanding_offers arguably does. The latter could be implemented
> by sampling an variable that is updated, but that's not very generic, so we
> should try to think of something better.
>

I agree that this will not be trivial cut. The fact that this is not
something trivially achievable is the primary I want to start this
conversation with the general dev community. We can absorb some work to
optimize on certain hot paths (I suspect roles specific ones in allocator
being one of them for us), but maintaining this in the long term will
definitely requires all contributors to help.

w.r.t. examples, it seems that Master::_outstanding_offers simply calls a
hashmap::size() on a hashmap object, so if the underlying container type
conforms to C++11's thread safe requirement (
http://en.cppreference.com/w/cpp/container#Thread_safety), we should at
least be able to call the size() function with the understanding that we
might get slightly stale value?

I think a more interesting example is Master::_task_starting(): not only
that is is not calculated from a simple const method,but also that the
result is actually generated by iterating on all tasks registered to the
master. This means the cost of calculating this is linear to number of
tasks in the cluster.

>
> > - serving metrics collection from a different libprocess routine.
>
> See MetricsProcess. One (mitigation?) approach would be to sample the
> metrics at a fixed rate and then serve the cached samples from the
> MetricsProcess. I expect most installations have multiple clients sampling
> the metrics, so this would at least decouple the sample rate from the
> metrics request rate.
>

That sounds a very good idea to start with. I still think certain code
should be augmented so they maintain the gauge value more efficiently, but
for whatever code which is harder to rewrite this can definitely improve
the situation.

>
> >
> > Any thoughts on these?
> >
> > --
> > Cheers,
> >
> > Zhitao Li
>
>

-- 
Cheers,

Zhitao Li

Re: Isolating metrics collection from master/agent slowness

Reply via email to