Hi Benjamin, Thanks for the response. I agree that we should get all the 4 options start as early as we can.
The reason I'm pushing for the 5) option is that we probably won't get everything fixed within 1-2 minor release cycles, and I need a way to get around the problem badly. Also, with a sampled metric routine, we have a chance to quantify how long metrics collection cycle itself takes, so we have a good starting point to quantify the entire problem we are dealing with. If you don't mind, I'll merge the conversation from the other thread into this one, file issues and try to find time to work on them. On Fri, May 26, 2017 at 7:42 PM, Benjamin Mahler <bmah...@apache.org> wrote: > This is great, I would love to see the metrics collection be low latency. > > Here is the response I gave last time, with the 4 options I think we should > tackle: > https://www.mail-archive.com/dev@mesos.apache.org/msg37113.html > > IMHO we should defer looking into the 5th option of sampling/caching > internally until we've thoroughly attempted to avoid tripping through the > Process queue, because if we succeed it will become unnecessary. > > On Mon, May 22, 2017 at 1:26 PM, Zhitao Li <zhitaoli...@gmail.com> wrote: > > > Thanks for the feedback, James. > > > > Replying to your points inline: > > > > On Mon, May 22, 2017 at 10:56 AM, James Peach <jor...@gmail.com> wrote: > > > > > > > > > On May 19, 2017, at 11:35 AM, Zhitao Li <zhitaoli...@gmail.com> > wrote: > > > > > > > > Hi, > > > > > > > > I'd like to start a conversation to talk about metrics collection > > > endpoints > > > > (especially `/metrics/snapshot`) behavior. > > > > > > > > Right now, these endpoints are served from the same master/agent's > > > > libprocess, and extensively uses `gauge` to chain further callbacks > to > > > > collect various metrics (DRF allocator specifically adds several > > metrics > > > > per role). > > > > > > > > This brings a problem when the system is under load: when the > > > > master/allocator libprocess becomes busy, stats collection itself > > becomes > > > > slow too. Flying dark when the system is under load is specifically > > > painful > > > > for an operator. > > > > > > Yes, sampling metrics should approach zero cost. > > > > > > > I would like to explore the direction of isolating metric collection > > even > > > > when the master is slow. A couple of ideas: > > > > > > > > - (short term) reduce usage of gauge and prefer counter (since I > > believe > > > > they are less affected); > > > > > > I'd rather not squash the semantics for performance reasons. If a > metric > > > has gauge semantics, I don't think we should represent that as a > Counter. > > > > > > > I recall that I had a previous conversation with @bmahler and he thought > > that certain gauges could be expresses as the differential of two > counters. > > > > We definitely cannot express a gauge as a Counter, because gauge value > can > > decrease while counter should always be treated as monotonically > increasing > > until process restart. > > > > > > > > > > > - alternative implementation of `gauge` which does not contend on > > > > master/allocator's event queue; > > > > > > This is doable in some circumstances, but not always. For example, > > > Master::_uptime_secs() doesn't need to run on the master queue, but > > > Master::_outstanding_offers arguably does. The latter could be > > implemented > > > by sampling an variable that is updated, but that's not very generic, > so > > we > > > should try to think of something better. > > > > > > > I agree that this will not be trivial cut. The fact that this is not > > something trivially achievable is the primary I want to start this > > conversation with the general dev community. We can absorb some work to > > optimize on certain hot paths (I suspect roles specific ones in allocator > > being one of them for us), but maintaining this in the long term will > > definitely requires all contributors to help. > > > > w.r.t. examples, it seems that Master::_outstanding_offers simply calls a > > hashmap::size() on a hashmap object, so if the underlying container type > > conforms to C++11's thread safe requirement ( > > http://en.cppreference.com/w/cpp/container#Thread_safety), we should at > > least be able to call the size() function with the understanding that we > > might get slightly stale value? > > > > I think a more interesting example is Master::_task_starting(): not only > > that is is not calculated from a simple const method,but also that the > > result is actually generated by iterating on all tasks registered to the > > master. This means the cost of calculating this is linear to number of > > tasks in the cluster. > > > > > > > > > > > > > - serving metrics collection from a different libprocess routine. > > > > > > See MetricsProcess. One (mitigation?) approach would be to sample the > > > metrics at a fixed rate and then serve the cached samples from the > > > MetricsProcess. I expect most installations have multiple clients > > sampling > > > the metrics, so this would at least decouple the sample rate from the > > > metrics request rate. > > > > > > > That sounds a very good idea to start with. I still think certain code > > should be augmented so they maintain the gauge value more efficiently, > but > > for whatever code which is harder to rewrite this can definitely improve > > the situation. > > > > > > > > > > > > > > > Any thoughts on these? > > > > > > > > -- > > > > Cheers, > > > > > > > > Zhitao Li > > > > > > > > > > > > -- > > Cheers, > > > > Zhitao Li > > > -- Cheers, Zhitao Li