Hi Enrico, Sushant,

I re-run the perf test with the data consistency check feature disabled
(i.e. -Dzookeeper.digest.enabled=false), the write performance issue of 3.6
is still there.

With everything exactly the same, the throughput of 3.6 was only 1/2 of 3.4
and the max latency was more than 8 times.

Any other points or thoughts?

Cheers,

Li













On Sat, Feb 20, 2021 at 9:04 PM Li Wang <li4w...@gmail.com> wrote:

> Thanks Sushant and Enrico!
>
> This is a really good point.  According to the 3.6 documentation, the
> feature is disabled by default.
> https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration.
> However, checking the code, the default is enabled.
>
> Let me set the zookeeper.digest.enabled to false and see how the write
> operation performs.
>
> Best,
>
> Li
>
>
>
>
> On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane <sushantma...@gmail.com>
> wrote:
>
>> Hi Li,
>>
>> On 3.6.2 consistency checker (adhash based) is enabled by default:
>>
>> https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136
>> .
>> It is not present in ZK 3.4.14.
>>
>> This feature does have some impact on write performance.
>>
>> Thanks,
>> Sushant
>>
>>
>> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli <eolive...@gmail.com>
>> wrote:
>>
>> > Li,
>> > I wonder of we have some new throttling/back pressure mechanisms that is
>> > enabled by default.
>> >
>> > Does anyone has some pointer to relevant implementations?
>> >
>> >
>> > Enrico
>> >
>> > Il Ven 19 Feb 2021, 19:46 Li Wang <li4w...@gmail.com> ha scritto:
>> >
>> > > Hi,
>> > >
>> > > We switched to Netty on both client side and server side and the
>> > > performance issue is still there.  Anyone has any insights on what
>> could
>> > be
>> > > the cause of higher latency?
>> > >
>> > > Thanks,
>> > >
>> > > Li
>> > >
>> > >
>> > >
>> > > On Mon, Feb 15, 2021 at 2:17 PM Li Wang <li4w...@gmail.com> wrote:
>> > >
>> > > > Hi Enrico,
>> > > >
>> > > >
>> > > > Thanks for the reply.
>> > > >
>> > > >
>> > > > 1. We are using NIO based stack, not Netty based yet.
>> > > >
>> > > > 2. Yes, here are some metrics on the client side.
>> > > >
>> > > >
>> > > > 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms,  Max
>> Latency
>> > > 31s
>> > > >
>> > > > 3.4: throughput: 15k, failure: 0,  Avg Latency: 30ms,  Max Latency:
>> > 1.6s
>> > > >
>> > > >
>> > > > 3. Yes, the JVM and zoo.cfg config are the exact same
>> > > >
>> > > > 10G of Heap
>> > > >
>> > > > 13G of Memory
>> > > >
>> > > > 5 Participante
>> > > >
>> > > > 5 Observere
>> > > >
>> > > > Client session timeout: 3000ms
>> > > >
>> > > > Server min session time: 4000ms
>> > > >
>> > > >
>> > > >
>> > > > 4. Yes, there are two types of  WARN logs and many “Expiring
>> session”
>> > > > INFO log
>> > > >
>> > > >
>> > > > 2021-02-15 22:04:36,506 [myid:4] - WARN
>> > > > [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception
>> > > >
>> > > > EndOfStreamException: Unable to read additional data from client, it
>> > > > probably closed the socket: address = /100.108.63.116:43366,
>> session =
>> > > > 0x400189fee9a000b
>> > > >
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164)
>> > > >
>> > > > at
>> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327)
>> > > >
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
>> > > >
>> > > > at
>> > > >
>> > >
>> >
>> org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
>> > > >
>> > > > at
>> > > >
>> > >
>> >
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>> > > >
>> > > > at
>> > > >
>> > >
>> >
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>> > > >
>> > > > at java.base/java.lang.Thread.run(Thread.java:834)
>> > > >
>> > > >
>> > > > 2021-02-15 22:05:14,428 [myid:4] - WARN
>> > > > [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap,
>> skipping
>> > > >
>> > > >
>> > > > 2021-02-15 22:01:51,427 [myid:4] - INFO
>> > > > [SessionTracker:ZooKeeperServer@610] - Expiring session
>> > > > 0x400189fee9a001e, timeout of 4000ms exceeded
>> > > >
>> > > >
>> > > >
>> > > > 5. Yes we upgrade both the client and the server to 3.6. Actually,
>> the
>> > > > issue happened with the combinations of
>> > > >
>> > > >
>> > > > 3.4 client and 3.6 server
>> > > >
>> > > > 3.6 client and 3.6 server
>> > > >
>> > > > Please let me know if you need any additional info.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Li
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Feb 15, 2021 at 1:44 PM Li Wang <li4w...@gmail.com> wrote:
>> > > >
>> > > >> Hi Enrico,
>> > > >>
>> > > >> Thanks for the reply.
>> > > >>
>> > > >> 1. We are using direct NIO based stack, not Netty based yet.
>> > > >> 2. Yes, on the client side, here are the metrics
>> > > >>
>> > > >> 3.6:
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli <
>> eolive...@gmail.com
>> > >
>> > > >> wrote:
>> > > >>
>> > > >>> IIRC The main difference is about the switch to Netty 4 and about
>> > using
>> > > >>> more DirectMemory. Are you using the Netty based stack?
>> > > >>>
>> > > >>> Apart from that macro difference there have been many many changes
>> > > since
>> > > >>> 3.4.
>> > > >>>
>> > > >>> Do you have some metrics to share?
>> > > >>> Are the  JVM configurations and zoo.cfg configuration equals to
>> each
>> > > >>> other?
>> > > >>>
>> > > >>> Do you see warnings on the server logs?
>> > > >>>
>> > > >>> Did you upgrade both the client and the server or only the server?
>> > > >>>
>> > > >>> Enrico
>> > > >>>
>> > > >>>
>> > > >>> Il Lun 15 Feb 2021, 18:30 Li Wang <li4w...@gmail.com> ha scritto:
>> > > >>>
>> > > >>> > Hi,
>> > > >>> >
>> > > >>> > We want to upgrade from 3.4.14 to 3.6.2.  During the
>> perform/load
>> > > >>> > comparison test,  it was found that the performance of 3.6 has
>> been
>> > > >>> > significantly degraded compared to 3.4 for the write operation.
>> > Under
>> > > >>> the
>> > > >>> > same load, there was a huge number of SessionExpired and
>> > > ConnectionLoss
>> > > >>> > errors in 3.6 while no such errors in 3.4.
>> > > >>> >
>> > > >>> > The load testing is 500 concurrent users with a cluster of 5
>> > > >>> participants
>> > > >>> > and 5 observers. The min session timeout on the server side is
>> > > 4000ms.
>> > > >>> >
>> > > >>> > I wonder if anyone has seen the same issue and has any insights
>> on
>> > > what
>> > > >>> > could be the cause of the performance degradation.
>> > > >>> >
>> > > >>> > Thanks
>> > > >>> >
>> > > >>> > Li
>> > > >>> >
>> > > >>>
>> > > >>
>> > >
>> >
>>
>

Reply via email to