batching metrics reporting can help. For example, in the CommitProcessor,
increasing the maxCommitBatchSize helps improving the the performance of
write operation.


On Mon, Apr 26, 2021 at 9:21 PM Li Wang <li4w...@gmail.com> wrote:

> Yes, I am thinking that handling metrics reporting in a separate thread,
> so it doesn't impact the "main" thread.
>
> Not sure about the idea of merging at the end of a reporting period. Can
> you elaborate a bit on it?
>
> Thanks,
>
> Li
>
> On Mon, Apr 26, 2021 at 9:11 PM Ted Dunning <ted.dunn...@gmail.com> wrote:
>
>> Would it help to keep per thread metrics that are either reported
>> independently or are merged at the end of a reporting period?
>>
>>
>>
>> On Mon, Apr 26, 2021 at 8:51 PM Li Wang <li4w...@gmail.com> wrote:
>>
>> > Hi Community,
>> >
>> > I've done further investigation on the issue and found the following
>> >
>> > 1. The perf of the read operation was decreased due to the lock
>> contention
>> > in the Prometheus TimeWindowQuantiles APIs. 3 out of 4 CommitProcWorker
>> > threads were blocked on the TimeWindowQuantiles.insert() API when the
>> test
>> > was.
>> >
>> > 2. The perf of the write operation was decreased because of the high CPU
>> > usage from Prometheus Summary type of metrics. The CPU usage of
>> > CommitProcessor increased about 50% when Prometheus was disabled
>> compared
>> > to enabled (46% vs 80% with 4 CPU, 63% vs 99% with 12 CPU).
>> >
>> >
>> > Prometheus integration is a great feature, however the negative
>> performance
>> > impact is very significant.  I wonder if anyone has any thoughts on how
>> to
>> > reduce the perf impact.
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> > Li
>> >
>> >
>> > On Tue, Apr 6, 2021 at 12:33 PM Li Wang <li4w...@gmail.com> wrote:
>> >
>> > > Hi,
>> > >
>> > > I would like to reach out to the community to see if anyone has some
>> > > insights or experience with the performance impact of enabling
>> prometheus
>> > > metrics.
>> > >
>> > > I have done load comparison tests for Prometheus enabled vs disabled
>> and
>> > > found the performance is reduced about 40%-60% for both read and write
>> > > oeprations (i.e. getData, getChildren and createNode).
>> > >
>> > > The load test was done with Zookeeper 3.7, cluster size of 5
>> participants
>> > > and 5 observers, each ZK server has 10G heap size and 4 cpu, 500
>> > concurrent
>> > > users sending requests.
>> > >
>> > > The performance impact is quite significant.  I wonder if this is
>> > expected
>> > > and what are things we can do to have ZK performing the same while
>> > > leveraging the new feature of Prometheus metric.
>> > >
>> > > Best,
>> > >
>> > > Li
>> > >
>> > >
>> > >
>> > >
>> >
>>
>

Reply via email to