[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

jtuple Fri, 03 Aug 2018 11:35:17 -0700

Github user jtuple commented on the issue:

    https://github.com/apache/zookeeper/pull/580
  
    I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go 
ahead and update this PR today to include that for completeness. Since it's 
easy change to make while we discuss any other reworkings.
    
    We actually landed a version of #440 internally at Facebook and have been 
running with Dropwiz-based percentile counters in production for a few months 
now. We build on the same `ServerMetrics` design in this PR and have an 
`AvgMinMaxPercentileCounter` that wraps a Dropwiz `Histogram` which uses a 
custom uniform-sampling reservoir implementation that enables resetting the 
metric/reservoir. This makes things consistent with all the other metric types 
that are resettable.
    
    In practice, the memory overhead for the Dropwiz histogram is much higher 
than the flat `AvgMinMaxCounter` metric, so we only converted about 20 of our 
100 metrics to `AvgMinMaxPercentileCounter` metrics -- pretty much just the 
latency metrics and metrics that track time spent in various queues and stages 
of the request pipeline.
    
    We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet` 
metric-types that aggregate metrics across a dynamic set of entities. These are 
types that we were planning to submit in future PRs that build on this one. 
Example uses of this includes tracking learner queue size/time and ack latency 
on the leader for each quorum peer.
    
    ---
    
    In any case, happy to rework this PR with guidance and have the bandwidth 
to do so if we can come to consensus on a design. My biggest concern is that 
pretty much all of our remaining internal features at Facebook that we want to 
upstream are blocked on this feature, since we aggressively add metrics when 
adding new features (so that we can monitor/track in production, etc). The 
sooner we can land something, the sooner we can rebase on top of whatever the 
agreed approach is and unblock our other contributions.
    
    Open to any design that the community is happy with, however I think 
there's a few things which are important to maintain:
    
    1. Any solution should make it super easy to add new metrics and use them 
throughout the codebase. We've learned the hard way that lowering the barrier 
to entry was the only way to truly motivate adding new metrics while adding new 
features.
    
    2. It's useful to have a `Metric` type interface that allows us to enforce 
certain requirements. As an example, ZooKeeper metrics have historically been 
resettable, which isn't necessarily true for other metric libraries. Having an 
adaptor class allowed us to wrap the Dropwiz histograms and add a fast reset 
capability that then behaved like all our other metrics.
    
    3. Even if we export to external systems, having the ability to have 
Zookeeper maintain internal metrics/aggregation is still important. There 
should be a low-barrier to entry for getting visibility into the system from 
day one with minimal config/setup.

---

[GitHub] zookeeper issue #580: ZOOKEEPER-3098: Add additional server metrics

Reply via email to