Thanks for the input. We can discuss and look more into different ways of changing Zookeeper code to reduce the impact. It would be even better if the issue can be addressed by some tuning knobs or customizations in the Prometheus client library as mentioned by Enrico.
Li On Mon, Apr 26, 2021 at 9:35 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > Batching metrics reporting is very similar to option (c) but with locking > like option (a). That can usually be made faster by passing a reference to > the metrics accumulator to the reporting thread which can do the batch > update without locks. Usually requires ping-pong metrics accumulators so > that a thread can accumulate in one accumulator for a bit, pass that to the > merge thread and switch to using the alternate accumulator. Since all > threads typically report at the same time, this avoids a stampede on the > global accumulator. > > > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li4w...@gmail.com> wrote: > > > batching metrics reporting can help. For example, in the CommitProcessor, > > increasing the maxCommitBatchSize helps improving the the performance of > > write operation. > > > > > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li4w...@gmail.com> wrote: > > > > > Yes, I am thinking that handling metrics reporting in a separate > thread, > > > so it doesn't impact the "main" thread. > > > > > > Not sure about the idea of merging at the end of a reporting period. > Can > > > you elaborate a bit on it? > > > > > > Thanks, > > > > > > Li > > > > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > >> Would it help to keep per thread metrics that are either reported > > >> independently or are merged at the end of a reporting period? > > >> > > >> > > >> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li4w...@gmail.com> wrote: > > >> > > >> > Hi Community, > > >> > > > >> > I've done further investigation on the issue and found the following > > >> > > > >> > 1. The perf of the read operation was decreased due to the lock > > >> contention > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 > > CommitProcWorker > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when > the > > >> test > > >> > was. > > >> > > > >> > 2. The perf of the write operation was decreased because of the high > > CPU > > >> > usage from Prometheus Summary type of metrics. The CPU usage of > > >> > CommitProcessor increased about 50% when Prometheus was disabled > > >> compared > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU). > > >> > > > >> > > > >> > Prometheus integration is a great feature, however the negative > > >> performance > > >> > impact is very significant. I wonder if anyone has any thoughts on > > how > > >> to > > >> > reduce the perf impact. > > >> > > > >> > > > >> > > > >> > Thanks, > > >> > > > >> > > > >> > Li > > >> > > > >> > > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com> wrote: > > >> > > > >> > > Hi, > > >> > > > > >> > > I would like to reach out to the community to see if anyone has > some > > >> > > insights or experience with the performance impact of enabling > > >> prometheus > > >> > > metrics. > > >> > > > > >> > > I have done load comparison tests for Prometheus enabled vs > disabled > > >> and > > >> > > found the performance is reduced about 40%-60% for both read and > > write > > >> > > oeprations (i.e. getData, getChildren and createNode). > > >> > > > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5 > > >> participants > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 > > >> > concurrent > > >> > > users sending requests. > > >> > > > > >> > > The performance impact is quite significant. I wonder if this is > > >> > expected > > >> > > and what are things we can do to have ZK performing the same while > > >> > > leveraging the new feature of Prometheus metric. > > >> > > > > >> > > Best, > > >> > > > > >> > > Li > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > >