What I observed is that the impact varies depending on what types of
metrics (count/gauge vs summary) are reported and how many metrics are
reported in the applications. Quite a lot of metrics have been added to the
CommitProcessor and they are pretty much all Summary.

For the read operation, the number of metrics involved are not as many as
the write operation. With a single CommitProcWork thread, the performance
is pretty much the same between prometheus enabled and disabled. However,
with multiple CommitProcWorker threads, the performance impact is
significant, as essentially only one worker running and the others are
blocked on prometheus metrics reporting. The following is from the thread
profiling.

[image: image.png]

It's great that we have the options of tuning some knobs in the Prometheus
client and also porting some customizations in the metrics collector.

Please let me know if you have any questions or need any additional info.

Thanks,

Li


On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <eolive...@gmail.com>
wrote:

> Sorry for top posting.
>
> I have never seen in other applications that Prometheus has such a
> significant impact.
>
> The first things that come into my mind:
> - collect a couple of dumps with some perf tool and dig into the problem
> -  verify that we have the latest version of Prometheus client
> - tune the few knobs we have in the Prometheus client
>
> In Apache Pulsar and in Apache Bookkeeper we have some customizations in
> the Prometheus metrics collectors, we could the a look and port those to
> Zookeeper (initially I worked on the Prometheus integration based on my
> usecases I have with Pulsar, Bookkeeper and other systems that already use
> Prometheus, but here in Zookeeper we are using the basic Prometheus client)
>
> Enrico
>
> Il Mar 27 Apr 2021, 06:35 Ted Dunning <ted.dunn...@gmail.com> ha scritto:
>
> > Batching metrics reporting is very similar to option (c) but with locking
> > like option (a). That can usually be made faster by passing a reference
> to
> > the metrics accumulator to the reporting thread which can do the batch
> > update without locks. Usually requires ping-pong metrics accumulators so
> > that a thread can accumulate in one accumulator for a bit, pass that to
> the
> > merge thread and switch to using the alternate accumulator. Since all
> > threads typically report at the same time, this avoids a stampede on the
> > global accumulator.
> >
> >
> > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li4w...@gmail.com> wrote:
> >
> > > batching metrics reporting can help. For example, in the
> CommitProcessor,
> > > increasing the maxCommitBatchSize helps improving the the performance
> of
> > > write operation.
> > >
> > >
> > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li4w...@gmail.com> wrote:
> > >
> > > > Yes, I am thinking that handling metrics reporting in a separate
> > thread,
> > > > so it doesn't impact the "main" thread.
> > > >
> > > > Not sure about the idea of merging at the end of a reporting period.
> > Can
> > > > you elaborate a bit on it?
> > > >
> > > > Thanks,
> > > >
> > > > Li
> > > >
> > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <ted.dunn...@gmail.com>
> > > wrote:
> > > >
> > > >> Would it help to keep per thread metrics that are either reported
> > > >> independently or are merged at the end of a reporting period?
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li4w...@gmail.com> wrote:
> > > >>
> > > >> > Hi Community,
> > > >> >
> > > >> > I've done further investigation on the issue and found the
> following
> > > >> >
> > > >> > 1. The perf of the read operation was decreased due to the lock
> > > >> contention
> > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > CommitProcWorker
> > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> > the
> > > >> test
> > > >> > was.
> > > >> >
> > > >> > 2. The perf of the write operation was decreased because of the
> high
> > > CPU
> > > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > > >> compared
> > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > >> >
> > > >> >
> > > >> > Prometheus integration is a great feature, however the negative
> > > >> performance
> > > >> > impact is very significant.  I wonder if anyone has any thoughts
> on
> > > how
> > > >> to
> > > >> > reduce the perf impact.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> >
> > > >> > Li
> > > >> >
> > > >> >
> > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com>
> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I would like to reach out to the community to see if anyone has
> > some
> > > >> > > insights or experience with the performance impact of enabling
> > > >> prometheus
> > > >> > > metrics.
> > > >> > >
> > > >> > > I have done load comparison tests for Prometheus enabled vs
> > disabled
> > > >> and
> > > >> > > found the performance is reduced about 40%-60% for both read and
> > > write
> > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > >> > >
> > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > >> participants
> > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > >> > concurrent
> > > >> > > users sending requests.
> > > >> > >
> > > >> > > The performance impact is quite significant.  I wonder if this
> is
> > > >> > expected
> > > >> > > and what are things we can do to have ZK performing the same
> while
> > > >> > > leveraging the new feature of Prometheus metric.
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Li
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to