[ https://issues.apache.org/jira/browse/ZOOKEEPER-4767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yike Xiao updated ZOOKEEPER-4767: --------------------------------- Labels: pull-request-available (was: ) > New implementation of prometheus quantile metrics based on DataSketches > ----------------------------------------------------------------------- > > Key: ZOOKEEPER-4767 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4767 > Project: ZooKeeper > Issue Type: Improvement > Components: metric system > Reporter: Yike Xiao > Priority: Major > Labels: pull-request-available > > If the built-in Prometheus metrics feature introduced after version 3.6 is > enabled, under high-load scenarios (such as when there are a large number of > read requests), the percentile metrics (Summary) used to collect request > latencies can easily become a bottleneck and impact the service itself. This > is because the internal implementation of Summary involves the overhead of > lock operations. In scenarios with a large number of requests, lock > contention can lead to a dramatic deterioration in request latency. The > details of this issue and related profiling can be viewed in ZOOKEEPER-4741. > In ZOOKEEPER-4289, the updates to Summary were switched to be executed in a > separate thread pool. While this approach avoids the overhead of lock > contention caused by multiple threads updating Summary simultaneously, it > introduces the operational overhead of the thread pool queue and additional > garbage collection (GC) overhead. Especially when the thread pool queue is > full, a large number of RejectedExecutionException instances will be thrown, > further increasing the pressure on GC. > To address problems above, I have implemented an almost lock-free solution > based on DataSketches. Benchmark results show that it offers over a 10x speed > improvement compared to version 3.9.1 and avoids frequent GC caused by > creating a large number of temporary objects. The trade-off is that the > latency percentiles will be displayed with a relative delay (default is 60 > seconds), and each Summary metric will have a certain amount of permanent > memory overhead. > This solution refers to Matteo Merli's optimization work on the percentile > latency metrics for Bookkeeper, as detailed in > https://github.com/apache/bookkeeper/commit/3bff19956e70e37c025a8e29aa8428937af77aa1. -- This message was sent by Atlassian Jira (v8.20.10#820010)