Hi Community, I've done further investigation on the issue and found the following
1. The perf of the read operation was decreased due to the lock contention in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker threads were blocked on the TimeWindowQuantiles.insert() API when the test was. 2. The perf of the write operation was decreased because of the high CPU usage from Prometheus Summary type of metrics. The CPU usage of CommitProcessor increased about 50% when Prometheus was disabled compared to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU). Prometheus integration is a great feature, however the negative performance impact is very significant. I wonder if anyone has any thoughts on how to reduce the perf impact. Thanks, Li On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com> wrote: > Hi, > > I would like to reach out to the community to see if anyone has some > insights or experience with the performance impact of enabling prometheus > metrics. > > I have done load comparison tests for Prometheus enabled vs disabled and > found the performance is reduced about 40%-60% for both read and write > oeprations (i.e. getData, getChildren and createNode). > > The load test was done with Zookeeper 3.7, cluster size of 5 participants > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 concurrent > users sending requests. > > The performance impact is quite significant. I wonder if this is expected > and what are things we can do to have ZK performing the same while > leveraging the new feature of Prometheus metric. > > Best, > > Li > > > >