Github user jtuple commented on the issue:
https://github.com/apache/zookeeper/pull/580
I'm happy to add tests for `AvgMinMaxCounter` and `SimpleCounter`, will go
ahead and update this PR today to include that for completeness. Since it's
easy change to make while we discuss any other reworkings.
We actually landed a version of #440 internally at Facebook and have been
running with Dropwiz-based percentile counters in production for a few months
now. We build on the same `ServerMetrics` design in this PR and have an
`AvgMinMaxPercentileCounter` that wraps a Dropwiz `Histogram` which uses a
custom uniform-sampling reservoir implementation that enables resetting the
metric/reservoir. This makes things consistent with all the other metric types
that are resettable.
In practice, the memory overhead for the Dropwiz histogram is much higher
than the flat `AvgMinMaxCounter` metric, so we only converted about 20 of our
100 metrics to `AvgMinMaxPercentileCounter` metrics -- pretty much just the
latency metrics and metrics that track time spent in various queues and stages
of the request pipeline.
We also have `AvgMinMaxCounterSet` and `AvgMinMaxPercentileCounterSet`
metric-types that aggregate metrics across a dynamic set of entities. These are
types that we were planning to submit in future PRs that build on this one.
Example uses of this includes tracking learner queue size/time and ack latency
on the leader for each quorum peer.
---
In any case, happy to rework this PR with guidance and have the bandwidth
to do so if we can come to consensus on a design. My biggest concern is that
pretty much all of our remaining internal features at Facebook that we want to
upstream are blocked on this feature, since we aggressively add metrics when
adding new features (so that we can monitor/track in production, etc). The
sooner we can land something, the sooner we can rebase on top of whatever the
agreed approach is and unblock our other contributions.
Open to any design that the community is happy with, however I think
there's a few things which are important to maintain:
1. Any solution should make it super easy to add new metrics and use them
throughout the codebase. We've learned the hard way that lowering the barrier
to entry was the only way to truly motivate adding new metrics while adding new
features.
2. It's useful to have a `Metric` type interface that allows us to enforce
certain requirements. As an example, ZooKeeper metrics have historically been
resettable, which isn't necessarily true for other metric libraries. Having an
adaptor class allowed us to wrap the Dropwiz histograms and add a fast reset
capability that then behaved like all our other metrics.
3. Even if we export to external systems, having the ability to have
Zookeeper maintain internal metrics/aggregation is still important. There
should be a low-barrier to entry for getting visibility into the system from
day one with minimal config/setup.
---