Yike Xiao created ZOOKEEPER-4767:
------------------------------------
Summary: New implementation of prometheus qunatile metrics based
on DataSketches
Key: ZOOKEEPER-4767
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767
Project: ZooKeeper
Issue Type: Improvement
Components: metric system
Reporter: Yike Xiao
If the built-in Prometheus metrics feature introduced after version 3.6 is
enabled, under high-load scenarios (such as when there are a large number of
read requests), the percentile metrics (Summary) used to collect request
latencies can easily become a bottleneck and impact the service itself. This is
because the internal implementation of Summary involves the overhead of lock
operations. In scenarios with a large number of requests, lock contention can
lead to a dramatic deterioration in request latency. The details of this issue
and related profiling can be viewed in ZOOKEEPER-4741.
In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a
separate thread pool. While this approach avoids the overhead of lock
contention caused by multiple threads updating Summary simultaneously, it
introduces the operational overhead of the thread pool queue and additional
garbage collection (GC) overhead. Especially when the thread pool queue is
full, a large number of RejectedExecutionException instances will be thrown,
further increasing the pressure on GC.
To address problems above, I have implemented an almost lock-free solution
based on DataSketches. Benchmark results show that it offers over a 10x speed
improvement compared to version 3.9.1 and avoids frequent GC caused by creating
a large number of temporary objects. The trade-off is that the latency
percentiles will be displayed with a relative delay (default is 60 seconds),
and each Summary metric will have a certain amount of permanent memory overhead.
This solution refers to Matteo Merli's optimization work on the percentile
latency metrics for Bookkeeper, as detailed in
https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)