Sorry for top posting. I have never seen in other applications that Prometheus has such a significant impact.
The first things that come into my mind: - collect a couple of dumps with some perf tool and dig into the problem - verify that we have the latest version of Prometheus client - tune the few knobs we have in the Prometheus client In Apache Pulsar and in Apache Bookkeeper we have some customizations in the Prometheus metrics collectors, we could the a look and port those to Zookeeper (initially I worked on the Prometheus integration based on my usecases I have with Pulsar, Bookkeeper and other systems that already use Prometheus, but here in Zookeeper we are using the basic Prometheus client) Enrico Il Mar 27 Apr 2021, 06:35 Ted Dunning <[email protected]> ha scritto: > Batching metrics reporting is very similar to option (c) but with locking > like option (a). That can usually be made faster by passing a reference to > the metrics accumulator to the reporting thread which can do the batch > update without locks. Usually requires ping-pong metrics accumulators so > that a thread can accumulate in one accumulator for a bit, pass that to the > merge thread and switch to using the alternate accumulator. Since all > threads typically report at the same time, this avoids a stampede on the > global accumulator. > > > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <[email protected]> wrote: > > > batching metrics reporting can help. For example, in the CommitProcessor, > > increasing the maxCommitBatchSize helps improving the the performance of > > write operation. > > > > > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <[email protected]> wrote: > > > > > Yes, I am thinking that handling metrics reporting in a separate > thread, > > > so it doesn't impact the "main" thread. > > > > > > Not sure about the idea of merging at the end of a reporting period. > Can > > > you elaborate a bit on it? > > > > > > Thanks, > > > > > > Li > > > > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <[email protected]> > > wrote: > > > > > >> Would it help to keep per thread metrics that are either reported > > >> independently or are merged at the end of a reporting period? > > >> > > >> > > >> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <[email protected]> wrote: > > >> > > >> > Hi Community, > > >> > > > >> > I've done further investigation on the issue and found the following > > >> > > > >> > 1. The perf of the read operation was decreased due to the lock > > >> contention > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 > > CommitProcWorker > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when > the > > >> test > > >> > was. > > >> > > > >> > 2. The perf of the write operation was decreased because of the high > > CPU > > >> > usage from Prometheus Summary type of metrics. The CPU usage of > > >> > CommitProcessor increased about 50% when Prometheus was disabled > > >> compared > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU). > > >> > > > >> > > > >> > Prometheus integration is a great feature, however the negative > > >> performance > > >> > impact is very significant. I wonder if anyone has any thoughts on > > how > > >> to > > >> > reduce the perf impact. > > >> > > > >> > > > >> > > > >> > Thanks, > > >> > > > >> > > > >> > Li > > >> > > > >> > > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <[email protected]> wrote: > > >> > > > >> > > Hi, > > >> > > > > >> > > I would like to reach out to the community to see if anyone has > some > > >> > > insights or experience with the performance impact of enabling > > >> prometheus > > >> > > metrics. > > >> > > > > >> > > I have done load comparison tests for Prometheus enabled vs > disabled > > >> and > > >> > > found the performance is reduced about 40%-60% for both read and > > write > > >> > > oeprations (i.e. getData, getChildren and createNode). > > >> > > > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5 > > >> participants > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500 > > >> > concurrent > > >> > > users sending requests. > > >> > > > > >> > > The performance impact is quite significant. I wonder if this is > > >> > expected > > >> > > and what are things we can do to have ZK performing the same while > > >> > > leveraging the new feature of Prometheus metric. > > >> > > > > >> > > Best, > > >> > > > > >> > > Li > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > > >
