Re: write performance issue in 3.6.2

Li Wang Thu, 11 Mar 2021 11:46:32 -0800

The CPU usage of both server and client are normal (< 50%) during the test.


Based on the investigation, the server is too busy with the load.

The issue doesn't exist in 3.4.14. I wonder why there is a significant
write performance degradation from 3.4.14 to 3.6.2 and how we can address
the issue.

Best,

Li











On Thu, Mar 11, 2021 at 11:25 AM Andor Molnar <[email protected]> wrote:

> What is the CPU usage of both server and client during the test?
>
> Looks like server is dropping the clients because either the server or
> both are too busy to deal with the load.
> This log line is also concerning: "Too busy to snap, skipping”
>
> If that’s the case I believe you'll have to profile the server process to
> figure out where the perf bottleneck is.
>
> Andor
>
>
>
>
> > On 2021. Feb 22., at 5:31, Li Wang <[email protected]> wrote:
> >
> > Thanks, Patrick.
> >
> > Yes, we are using the same JVM version and GC configurations when
> > running the two tests. I have checked the GC metrics and also the heap
> dump
> > of the 3.6, the GC pause and the memory usage look okay.
> >
> > Best,
> >
> > Li
> >
> > On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt <[email protected]> wrote:
> >
> >> On Sun, Feb 21, 2021 at 3:28 PM Li Wang <[email protected]> wrote:
> >>
> >>> Hi Enrico, Sushant,
> >>>
> >>> I re-run the perf test with the data consistency check feature disabled
> >>> (i.e. -Dzookeeper.digest.enabled=false), the write performance issue of
> >> 3.6
> >>> is still there.
> >>>
> >>> With everything exactly the same, the throughput of 3.6 was only 1/2 of
> >> 3.4
> >>> and the max latency was more than 8 times.
> >>>
> >>> Any other points or thoughts?
> >>>
> >>>
> >> In the past I've noticed a big impact of GC when doing certain
> performance
> >> measurements. I assume you are using the same JVM version and GC when
> >> running the two tests? Perhaps our memory footprint has expanded over
> time.
> >> You should rule out GC by running with gc logging turned on with both
> >> versions and compare the impact.
> >>
> >> Regards,
> >>
> >> Patrick
> >>
> >>
> >>> Cheers,
> >>>
> >>> Li
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Sat, Feb 20, 2021 at 9:04 PM Li Wang <[email protected]> wrote:
> >>>
> >>>> Thanks Sushant and Enrico!
> >>>>
> >>>> This is a really good point.  According to the 3.6 documentation, the
> >>>> feature is disabled by default.
> >>>>
> >>>
> >>
> https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration
> >>> .
> >>>> However, checking the code, the default is enabled.
> >>>>
> >>>> Let me set the zookeeper.digest.enabled to false and see how the write
> >>>> operation performs.
> >>>>
> >>>> Best,
> >>>>
> >>>> Li
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Li,
> >>>>>
> >>>>> On 3.6.2 consistency checker (adhash based) is enabled by default:
> >>>>>
> >>>>>
> >>>
> >>
> https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136
> >>>>> .
> >>>>> It is not present in ZK 3.4.14.
> >>>>>
> >>>>> This feature does have some impact on write performance.
> >>>>>
> >>>>> Thanks,
> >>>>> Sushant
> >>>>>
> >>>>>
> >>>>> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli <
> [email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Li,
> >>>>>> I wonder of we have some new throttling/back pressure mechanisms
> >> that
> >>> is
> >>>>>> enabled by default.
> >>>>>>
> >>>>>> Does anyone has some pointer to relevant implementations?
> >>>>>>
> >>>>>>
> >>>>>> Enrico
> >>>>>>
> >>>>>> Il Ven 19 Feb 2021, 19:46 Li Wang <[email protected]> ha scritto:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> We switched to Netty on both client side and server side and the
> >>>>>>> performance issue is still there.  Anyone has any insights on what
> >>>>> could
> >>>>>> be
> >>>>>>> the cause of higher latency?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Li
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Feb 15, 2021 at 2:17 PM Li Wang <[email protected]>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi Enrico,
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks for the reply.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 1. We are using NIO based stack, not Netty based yet.
> >>>>>>>>
> >>>>>>>> 2. Yes, here are some metrics on the client side.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms,  Max
> >>>>> Latency
> >>>>>>> 31s
> >>>>>>>>
> >>>>>>>> 3.4: throughput: 15k, failure: 0,  Avg Latency: 30ms,  Max
> >>> Latency:
> >>>>>> 1.6s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3. Yes, the JVM and zoo.cfg config are the exact same
> >>>>>>>>
> >>>>>>>> 10G of Heap
> >>>>>>>>
> >>>>>>>> 13G of Memory
> >>>>>>>>
> >>>>>>>> 5 Participante
> >>>>>>>>
> >>>>>>>> 5 Observere
> >>>>>>>>
> >>>>>>>> Client session timeout: 3000ms
> >>>>>>>>
> >>>>>>>> Server min session time: 4000ms
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 4. Yes, there are two types of  WARN logs and many “Expiring
> >>>>> session”
> >>>>>>>> INFO log
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2021-02-15 22:04:36,506 [myid:4] - WARN
> >>>>>>>> [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception
> >>>>>>>>
> >>>>>>>> EndOfStreamException: Unable to read additional data from
> >> client,
> >>> it
> >>>>>>>> probably closed the socket: address = /100.108.63.116:43366,
> >>>>> session =
> >>>>>>>> 0x400189fee9a000b
> >>>>>>>>
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164)
> >>>>>>>>
> >>>>>>>> at
> >>>>>>
> >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327)
> >>>>>>>>
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> >>>>>>>>
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> >>>>>>>>
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >>>>>>>>
> >>>>>>>> at
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >>>>>>>>
> >>>>>>>> at java.base/java.lang.Thread.run(Thread.java:834)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2021-02-15 22:05:14,428 [myid:4] - WARN
> >>>>>>>> [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap,
> >>>>> skipping
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2021-02-15 22:01:51,427 [myid:4] - INFO
> >>>>>>>> [SessionTracker:ZooKeeperServer@610] - Expiring session
> >>>>>>>> 0x400189fee9a001e, timeout of 4000ms exceeded
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 5. Yes we upgrade both the client and the server to 3.6.
> >> Actually,
> >>>>> the
> >>>>>>>> issue happened with the combinations of
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 3.4 client and 3.6 server
> >>>>>>>>
> >>>>>>>> 3.6 client and 3.6 server
> >>>>>>>>
> >>>>>>>> Please let me know if you need any additional info.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Li
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Feb 15, 2021 at 1:44 PM Li Wang <[email protected]>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Enrico,
> >>>>>>>>>
> >>>>>>>>> Thanks for the reply.
> >>>>>>>>>
> >>>>>>>>> 1. We are using direct NIO based stack, not Netty based yet.
> >>>>>>>>> 2. Yes, on the client side, here are the metrics
> >>>>>>>>>
> >>>>>>>>> 3.6:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli <
> >>>>> [email protected]
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> IIRC The main difference is about the switch to Netty 4 and
> >>> about
> >>>>>> using
> >>>>>>>>>> more DirectMemory. Are you using the Netty based stack?
> >>>>>>>>>>
> >>>>>>>>>> Apart from that macro difference there have been many many
> >>> changes
> >>>>>>> since
> >>>>>>>>>> 3.4.
> >>>>>>>>>>
> >>>>>>>>>> Do you have some metrics to share?
> >>>>>>>>>> Are the  JVM configurations and zoo.cfg configuration equals
> >> to
> >>>>> each
> >>>>>>>>>> other?
> >>>>>>>>>>
> >>>>>>>>>> Do you see warnings on the server logs?
> >>>>>>>>>>
> >>>>>>>>>> Did you upgrade both the client and the server or only the
> >>> server?
> >>>>>>>>>>
> >>>>>>>>>> Enrico
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Il Lun 15 Feb 2021, 18:30 Li Wang <[email protected]> ha
> >>> scritto:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> We want to upgrade from 3.4.14 to 3.6.2.  During the
> >>>>> perform/load
> >>>>>>>>>>> comparison test,  it was found that the performance of 3.6
> >> has
> >>>>> been
> >>>>>>>>>>> significantly degraded compared to 3.4 for the write
> >>> operation.
> >>>>>> Under
> >>>>>>>>>> the
> >>>>>>>>>>> same load, there was a huge number of SessionExpired and
> >>>>>>> ConnectionLoss
> >>>>>>>>>>> errors in 3.6 while no such errors in 3.4.
> >>>>>>>>>>>
> >>>>>>>>>>> The load testing is 500 concurrent users with a cluster of 5
> >>>>>>>>>> participants
> >>>>>>>>>>> and 5 observers. The min session timeout on the server side
> >> is
> >>>>>>> 4000ms.
> >>>>>>>>>>>
> >>>>>>>>>>> I wonder if anyone has seen the same issue and has any
> >>> insights
> >>>>> on
> >>>>>>> what
> >>>>>>>>>>> could be the cause of the performance degradation.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks
> >>>>>>>>>>>
> >>>>>>>>>>> Li
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: write performance issue in 3.6.2

Reply via email to