[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yike Xiao updated ZOOKEEPER-4767:
---------------------------------
    Labels: pull-request-available  (was: )

> New implementation of prometheus quantile metrics based on DataSketches
> -----------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4767
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: metric system
>            Reporter: Yike Xiao
>            Priority: Major
>              Labels: pull-request-available
>
> If the built-in Prometheus metrics feature introduced after version 3.6 is 
> enabled, under high-load scenarios (such as when there are a large number of 
> read requests), the percentile metrics (Summary) used to collect request 
> latencies can easily become a bottleneck and impact the service itself. This 
> is because the internal implementation of Summary involves the overhead of 
> lock operations. In scenarios with a large number of requests, lock 
> contention can lead to a dramatic deterioration in request latency. The 
> details of this issue and related profiling can be viewed in ZOOKEEPER-4741.
> In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a 
> separate thread pool. While this approach avoids the overhead of lock 
> contention caused by multiple threads updating Summary simultaneously, it 
> introduces the operational overhead of the thread pool queue and additional 
> garbage collection (GC) overhead. Especially when the thread pool queue is 
> full, a large number of RejectedExecutionException instances will be thrown, 
> further increasing the pressure on GC.
> To address problems above, I have implemented an almost lock-free solution 
> based on DataSketches. Benchmark results show that it offers over a 10x speed 
> improvement compared to version 3.9.1 and avoids frequent GC caused by 
> creating a large number of temporary objects. The trade-off is that the 
> latency percentiles will be displayed with a relative delay (default is 60 
> seconds), and each Summary metric will have a certain amount of permanent 
> memory overhead.
> This solution refers to Matteo Merli's optimization work on the percentile 
> latency metrics for Bookkeeper, as detailed in 
> https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to