Sorry for top posting.

I have never seen in other applications that Prometheus has such a
significant impact.

The first things that come into my mind:
- collect a couple of dumps with some perf tool and dig into the problem
-  verify that we have the latest version of Prometheus client
- tune the few knobs we have in the Prometheus client

In Apache Pulsar and in Apache Bookkeeper we have some customizations in
the Prometheus metrics collectors, we could the a look and port those to
Zookeeper (initially I worked on the Prometheus integration based on my
usecases I have with Pulsar, Bookkeeper and other systems that already use
Prometheus, but here in Zookeeper we are using the basic Prometheus client)

Enrico

Il Mar 27 Apr 2021, 06:35 Ted Dunning <[email protected]> ha scritto:

> Batching metrics reporting is very similar to option (c) but with locking
> like option (a). That can usually be made faster by passing a reference to
> the metrics accumulator to the reporting thread which can do the batch
> update without locks. Usually requires ping-pong metrics accumulators so
> that a thread can accumulate in one accumulator for a bit, pass that to the
> merge thread and switch to using the alternate accumulator. Since all
> threads typically report at the same time, this avoids a stampede on the
> global accumulator.
>
>
> On Mon, Apr 26, 2021 at 9:30 PM Li Wang <[email protected]> wrote:
>
> > batching metrics reporting can help. For example, in the CommitProcessor,
> > increasing the maxCommitBatchSize helps improving the the performance of
> > write operation.
> >
> >
> > On Mon, Apr 26, 2021 at 9:21 PM Li Wang <[email protected]> wrote:
> >
> > > Yes, I am thinking that handling metrics reporting in a separate
> thread,
> > > so it doesn't impact the "main" thread.
> > >
> > > Not sure about the idea of merging at the end of a reporting period.
> Can
> > > you elaborate a bit on it?
> > >
> > > Thanks,
> > >
> > > Li
> > >
> > > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <[email protected]>
> > wrote:
> > >
> > >> Would it help to keep per thread metrics that are either reported
> > >> independently or are merged at the end of a reporting period?
> > >>
> > >>
> > >>
> > >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <[email protected]> wrote:
> > >>
> > >> > Hi Community,
> > >> >
> > >> > I've done further investigation on the issue and found the following
> > >> >
> > >> > 1. The perf of the read operation was decreased due to the lock
> > >> contention
> > >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> > CommitProcWorker
> > >> > threads were blocked on the TimeWindowQuantiles.insert() API when
> the
> > >> test
> > >> > was.
> > >> >
> > >> > 2. The perf of the write operation was decreased because of the high
> > CPU
> > >> > usage from Prometheus Summary type of metrics. The CPU usage of
> > >> > CommitProcessor increased about 50% when Prometheus was disabled
> > >> compared
> > >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> > >> >
> > >> >
> > >> > Prometheus integration is a great feature, however the negative
> > >> performance
> > >> > impact is very significant.  I wonder if anyone has any thoughts on
> > how
> > >> to
> > >> > reduce the perf impact.
> > >> >
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >> >
> > >> > Li
> > >> >
> > >> >
> > >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <[email protected]> wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I would like to reach out to the community to see if anyone has
> some
> > >> > > insights or experience with the performance impact of enabling
> > >> prometheus
> > >> > > metrics.
> > >> > >
> > >> > > I have done load comparison tests for Prometheus enabled vs
> disabled
> > >> and
> > >> > > found the performance is reduced about 40%-60% for both read and
> > write
> > >> > > oeprations (i.e. getData, getChildren and createNode).
> > >> > >
> > >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> > >> participants
> > >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> > >> > concurrent
> > >> > > users sending requests.
> > >> > >
> > >> > > The performance impact is quite significant.  I wonder if this is
> > >> > expected
> > >> > > and what are things we can do to have ZK performing the same while
> > >> > > leveraging the new feature of Prometheus metric.
> > >> > >
> > >> > > Best,
> > >> > >
> > >> > > Li
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Reply via email to