[
https://issues.apache.org/jira/browse/ZOOKEEPER-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yike Xiao updated ZOOKEEPER-4767:
---------------------------------
Labels: pull-request-available (was: )
> New implementation of prometheus quantile metrics based on DataSketches
> -----------------------------------------------------------------------
>
> Key: ZOOKEEPER-4767
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767
> Project: ZooKeeper
> Issue Type: Improvement
> Components: metric system
> Reporter: Yike Xiao
> Priority: Major
> Labels: pull-request-available
>
> If the built-in Prometheus metrics feature introduced after version 3.6 is
> enabled, under high-load scenarios (such as when there are a large number of
> read requests), the percentile metrics (Summary) used to collect request
> latencies can easily become a bottleneck and impact the service itself. This
> is because the internal implementation of Summary involves the overhead of
> lock operations. In scenarios with a large number of requests, lock
> contention can lead to a dramatic deterioration in request latency. The
> details of this issue and related profiling can be viewed in ZOOKEEPER-4741.
> In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a
> separate thread pool. While this approach avoids the overhead of lock
> contention caused by multiple threads updating Summary simultaneously, it
> introduces the operational overhead of the thread pool queue and additional
> garbage collection (GC) overhead. Especially when the thread pool queue is
> full, a large number of RejectedExecutionException instances will be thrown,
> further increasing the pressure on GC.
> To address problems above, I have implemented an almost lock-free solution
> based on DataSketches. Benchmark results show that it offers over a 10x speed
> improvement compared to version 3.9.1 and avoids frequent GC caused by
> creating a large number of temporary objects. The trade-off is that the
> latency percentiles will be displayed with a relative delay (default is 60
> seconds), and each Summary metric will have a certain amount of permanent
> memory overhead.
> This solution refers to Matteo Merli's optimization work on the percentile
> latency metrics for Bookkeeper, as detailed in
> https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)