Re: Performance impact of enabling Prometheus Metrics

Li Wang Wed, 28 Apr 2021 12:48:34 -0700

Hi Enrico,

Please see my comments inline.


Thanks,

 Li.

On Mon, Apr 26, 2021 at 10:42 PM Enrico Olivelli <[email protected]>
wrote:

> Sorry for top posting.
>
> I have never seen in other applications that Prometheus has such a
> significant impact.


Please see my reply in the previous response.


>
> The first things that come into my mind:
> - collect a couple of dumps with some perf tool and dig into the problem


Heap dump, thread dump and CPU profiling with Prometheus enabled vs
disabled have been collected. That's how we found the issue.


> -  verify that we have the latest version of Prometheus client
>

ZK uses 0.9.0 and the latest version of Prometheus client is 0.10.0.
Checked in the change logs, it doesn't look like there are major changes
between these two. https://github.com/prometheus/client_java/releases We
can upgrade the version and see if it makes any difference.

- tune the few knobs we have in the Prometheus client
>

What are the knobs we can tune? Can you provide more info on this?

>
> In Apache Pulsar and in Apache Bookkeeper we have some customizations in
> the Prometheus metrics collectors, we could the a look and port those to
> Zookeeper (initially I worked on the Prometheus integration based on my
> usecases I have with Pulsar, Bookkeeper and other systems that already use
> Prometheus, but here in Zookeeper we are using the basic Prometheus client)
>

Are there any customizations for minimizing the performance impact?  Is
there anything that we can port to ZK to help the issue?


> Enrico
>
> Il Mar 27 Apr 2021, 06:35 Ted Dunning <[email protected]> ha scritto:
>
> > Batching metrics reporting is very similar to option (c) but with locking
> > like option (a). That can usually be made faster by passing a reference
> to
> > the metrics accumulator to the reporting thread which can do the batch
> > update without locks. Usually requires ping-pong metrics accumulators so
> > that a thread can accumulate in one accumulator for a bit, pass that to
> the
> > merge thread and switch to using the alternate accumulator. Since all
> > threads typically report at the same time, this avoids a stampede on the
> > global accumulator.
> >
> >
> > On Mon, Apr 26, 2021 at 9:30 PM Li Wang <[email protected]> wrote:
> >
> > > batching metrics reporting can help. For example, in the
> CommitProcessor,
> > > increasing the maxCommitBatchSize helps improving the the performance
> of
> > > write operation.
> > >
> > >
> > > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <[email protected]> wrote:
> > >
> > > > Yes, I am thinking that handling metrics reporting in a separate
> > thread,
> > > > so it doesn't impact the "main" thread.
> > > >
> > > > Not sure about the idea of merging at the end of a reporting period.
> > Can
> > > > you elaborate a bit on it?
> > > >
> > > > Thanks,
> > > >
> > > > Li
> > > >
> > > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <[email protected]>
> > > wrote:
> > > >
> > > >> Would it help to keep per thread metrics that are either reported
> > > >> independently or are merged at the end of a reporting period?
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <[email protected]> wrote:
> > > >>
> > > >> > Hi Community,
> > > >> >
> > > >> > I've done further investigation on the issue and found the
> following
> > > >> >
> > > >> > 1. The perf of the read operation was decreased due to the lock
> > > >> contention
> > > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > > CommitProcWorker
> > > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> > the
> > > >> test
> > > >> > was.
> > > >> >
> > > >> > 2. The perf of the write operation was decreased because of the
> high
> > > CPU
> > > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > > >> compared
> > > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > > >> >
> > > >> >
> > > >> > Prometheus integration is a great feature, however the negative
> > > >> performance
> > > >> > impact is very significant.  I wonder if anyone has any thoughts
> on
> > > how
> > > >> to
> > > >> > reduce the perf impact.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> >
> > > >> > Li
> > > >> >
> > > >> >
> > > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <[email protected]>
> wrote:
> > > >> >
> > > >> > > Hi,
> > > >> > >
> > > >> > > I would like to reach out to the community to see if anyone has
> > some
> > > >> > > insights or experience with the performance impact of enabling
> > > >> prometheus
> > > >> > > metrics.
> > > >> > >
> > > >> > > I have done load comparison tests for Prometheus enabled vs
> > disabled
> > > >> and
> > > >> > > found the performance is reduced about 40%-60% for both read and
> > > write
> > > >> > > oeprations (i.e. getData, getChildren and createNode).
> > > >> > >
> > > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > > >> participants
> > > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > > >> > concurrent
> > > >> > > users sending requests.
> > > >> > >
> > > >> > > The performance impact is quite significant.  I wonder if this
> is
> > > >> > expected
> > > >> > > and what are things we can do to have ZK performing the same
> while
> > > >> > > leveraging the new feature of Prometheus metric.
> > > >> > >
> > > >> > > Best,
> > > >> > >
> > > >> > > Li
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Performance impact of enabling Prometheus Metrics

Reply via email to