batching metrics reporting can help. For example, in the CommitProcessor, increasing the maxCommitBatchSize helps improving the the performance of write operation.
On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li4w...@gmail.com> wrote: > Yes, I am thinking that handling metrics reporting in a separate thread, > so it doesn't impact the "main" thread. > > Not sure about the idea of merging at the end of a reporting period. Can > you elaborate a bit on it? > > Thanks, > > Li > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > >> Would it help to keep per thread metrics that are either reported >> independently or are merged at the end of a reporting period? >> >> >> >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li4w...@gmail.com> wrote: >> >> > Hi Community, >> > >> > I've done further investigation on the issue and found the following >> > >> > 1. The perf of the read operation was decreased due to the lock >> contention >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker >> > threads were blocked on the TimeWindowQuantiles.insert() API when the >> test >> > was. >> > >> > 2. The perf of the write operation was decreased because of the high CPU >> > usage from Prometheus Summary type of metrics. The CPU usage of >> > CommitProcessor increased about 50% when Prometheus was disabled >> compared >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU). >> > >> > >> > Prometheus integration is a great feature, however the negative >> performance >> > impact is very significant. I wonder if anyone has any thoughts on how >> to >> > reduce the perf impact. >> > >> > >> > >> > Thanks, >> > >> > >> > Li >> > >> > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com> wrote: >> > >> > > Hi, >> > > >> > > I would like to reach out to the community to see if anyone has some >> > > insights or experience with the performance impact of enabling >> prometheus >> > > metrics. >> > > >> > > I have done load comparison tests for Prometheus enabled vs disabled >> and >> > > found the performance is reduced about 40%-60% for both read and write >> > > oeprations (i.e. getData, getChildren and createNode). >> > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5 >> participants >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 >> > concurrent >> > > users sending requests. >> > > >> > > The performance impact is quite significant. I wonder if this is >> > expected >> > > and what are things we can do to have ZK performing the same while >> > > leveraging the new feature of Prometheus metric. >> > > >> > > Best, >> > > >> > > Li >> > > >> > > >> > > >> > > >> > >> >