Batching metrics reporting is very similar to option (c) but with locking
like option (a). That can usually be made faster by passing a reference to
the metrics accumulator to the reporting thread which can do the batch
update without locks. Usually requires ping-pong metrics accumulators so
that a thread can accumulate in one accumulator for a bit, pass that to the
merge thread and switch to using the alternate accumulator. Since all
threads typically report at the same time, this avoids a stampede on the
global accumulator.


On Mon, Apr 26, 2021 at 9:30 PM Li Wang <li4w...@gmail.com> wrote:

> batching metrics reporting can help. For example, in the CommitProcessor,
> increasing the maxCommitBatchSize helps improving the the performance of
> write operation.
>
>
> On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li4w...@gmail.com> wrote:
>
> > Yes, I am thinking that handling metrics reporting in a separate thread,
> > so it doesn't impact the "main" thread.
> >
> > Not sure about the idea of merging at the end of a reporting period. Can
> > you elaborate a bit on it?
> >
> > Thanks,
> >
> > Li
> >
> > On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> Would it help to keep per thread metrics that are either reported
> >> independently or are merged at the end of a reporting period?
> >>
> >>
> >>
> >> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li4w...@gmail.com> wrote:
> >>
> >> > Hi Community,
> >> >
> >> > I've done further investigation on the issue and found the following
> >> >
> >> > 1. The perf of the read operation was decreased due to the lock
> >> contention
> >> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4
> CommitProcWorker
> >> > threads were blocked on the TimeWindowQuantiles.insert() API when the
> >> test
> >> > was.
> >> >
> >> > 2. The perf of the write operation was decreased because of the high
> CPU
> >> > usage from Prometheus Summary type of metrics. The CPU usage of
> >> > CommitProcessor increased about 50% when Prometheus was disabled
> >> compared
> >> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
> >> >
> >> >
> >> > Prometheus integration is a great feature, however the negative
> >> performance
> >> > impact is very significant.  I wonder if anyone has any thoughts on
> how
> >> to
> >> > reduce the perf impact.
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> >
> >> > Li
> >> >
> >> >
> >> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I would like to reach out to the community to see if anyone has some
> >> > > insights or experience with the performance impact of enabling
> >> prometheus
> >> > > metrics.
> >> > >
> >> > > I have done load comparison tests for Prometheus enabled vs disabled
> >> and
> >> > > found the performance is reduced about 40%-60% for both read and
> write
> >> > > oeprations (i.e. getData, getChildren and createNode).
> >> > >
> >> > > The load test was done with Zookeeper 3.7, cluster size of 5
> >> participants
> >> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
> >> > concurrent
> >> > > users sending requests.
> >> > >
> >> > > The performance impact is quite significant.  I wonder if this is
> >> > expected
> >> > > and what are things we can do to have ZK performing the same while
> >> > > leveraging the new feature of Prometheus metric.
> >> > >
> >> > > Best,
> >> > >
> >> > > Li
> >> > >
> >> > >
> >> > >
> >> > >
> >> >
> >>
> >
>

Reply via email to