The CPU usage of both server and client are normal (< 50%) during the test.
Based on the investigation, the server is too busy with the load. The issue doesn't exist in 3.4.14. I wonder why there is a significant write performance degradation from 3.4.14 to 3.6.2 and how we can address the issue. Best, Li On Thu, Mar 11, 2021 at 11:25 AM Andor Molnar <an...@apache.org> wrote: > What is the CPU usage of both server and client during the test? > > Looks like server is dropping the clients because either the server or > both are too busy to deal with the load. > This log line is also concerning: "Too busy to snap, skipping” > > If that’s the case I believe you'll have to profile the server process to > figure out where the perf bottleneck is. > > Andor > > > > > > On 2021. Feb 22., at 5:31, Li Wang <li4w...@gmail.com> wrote: > > > > Thanks, Patrick. > > > > Yes, we are using the same JVM version and GC configurations when > > running the two tests. I have checked the GC metrics and also the heap > dump > > of the 3.6, the GC pause and the memory usage look okay. > > > > Best, > > > > Li > > > > On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt <ph...@apache.org> wrote: > > > >> On Sun, Feb 21, 2021 at 3:28 PM Li Wang <li4w...@gmail.com> wrote: > >> > >>> Hi Enrico, Sushant, > >>> > >>> I re-run the perf test with the data consistency check feature disabled > >>> (i.e. -Dzookeeper.digest.enabled=false), the write performance issue of > >> 3.6 > >>> is still there. > >>> > >>> With everything exactly the same, the throughput of 3.6 was only 1/2 of > >> 3.4 > >>> and the max latency was more than 8 times. > >>> > >>> Any other points or thoughts? > >>> > >>> > >> In the past I've noticed a big impact of GC when doing certain > performance > >> measurements. I assume you are using the same JVM version and GC when > >> running the two tests? Perhaps our memory footprint has expanded over > time. > >> You should rule out GC by running with gc logging turned on with both > >> versions and compare the impact. > >> > >> Regards, > >> > >> Patrick > >> > >> > >>> Cheers, > >>> > >>> Li > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Sat, Feb 20, 2021 at 9:04 PM Li Wang <li4w...@gmail.com> wrote: > >>> > >>>> Thanks Sushant and Enrico! > >>>> > >>>> This is a really good point. According to the 3.6 documentation, the > >>>> feature is disabled by default. > >>>> > >>> > >> > https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration > >>> . > >>>> However, checking the code, the default is enabled. > >>>> > >>>> Let me set the zookeeper.digest.enabled to false and see how the write > >>>> operation performs. > >>>> > >>>> Best, > >>>> > >>>> Li > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane <sushantma...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi Li, > >>>>> > >>>>> On 3.6.2 consistency checker (adhash based) is enabled by default: > >>>>> > >>>>> > >>> > >> > https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136 > >>>>> . > >>>>> It is not present in ZK 3.4.14. > >>>>> > >>>>> This feature does have some impact on write performance. > >>>>> > >>>>> Thanks, > >>>>> Sushant > >>>>> > >>>>> > >>>>> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli < > eolive...@gmail.com > >>> > >>>>> wrote: > >>>>> > >>>>>> Li, > >>>>>> I wonder of we have some new throttling/back pressure mechanisms > >> that > >>> is > >>>>>> enabled by default. > >>>>>> > >>>>>> Does anyone has some pointer to relevant implementations? > >>>>>> > >>>>>> > >>>>>> Enrico > >>>>>> > >>>>>> Il Ven 19 Feb 2021, 19:46 Li Wang <li4w...@gmail.com> ha scritto: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> We switched to Netty on both client side and server side and the > >>>>>>> performance issue is still there. Anyone has any insights on what > >>>>> could > >>>>>> be > >>>>>>> the cause of higher latency? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> > >>>>>>> Li > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Feb 15, 2021 at 2:17 PM Li Wang <li4w...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>>> Hi Enrico, > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks for the reply. > >>>>>>>> > >>>>>>>> > >>>>>>>> 1. We are using NIO based stack, not Netty based yet. > >>>>>>>> > >>>>>>>> 2. Yes, here are some metrics on the client side. > >>>>>>>> > >>>>>>>> > >>>>>>>> 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms, Max > >>>>> Latency > >>>>>>> 31s > >>>>>>>> > >>>>>>>> 3.4: throughput: 15k, failure: 0, Avg Latency: 30ms, Max > >>> Latency: > >>>>>> 1.6s > >>>>>>>> > >>>>>>>> > >>>>>>>> 3. Yes, the JVM and zoo.cfg config are the exact same > >>>>>>>> > >>>>>>>> 10G of Heap > >>>>>>>> > >>>>>>>> 13G of Memory > >>>>>>>> > >>>>>>>> 5 Participante > >>>>>>>> > >>>>>>>> 5 Observere > >>>>>>>> > >>>>>>>> Client session timeout: 3000ms > >>>>>>>> > >>>>>>>> Server min session time: 4000ms > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 4. Yes, there are two types of WARN logs and many “Expiring > >>>>> session” > >>>>>>>> INFO log > >>>>>>>> > >>>>>>>> > >>>>>>>> 2021-02-15 22:04:36,506 [myid:4] - WARN > >>>>>>>> [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception > >>>>>>>> > >>>>>>>> EndOfStreamException: Unable to read additional data from > >> client, > >>> it > >>>>>>>> probably closed the socket: address = /100.108.63.116:43366, > >>>>> session = > >>>>>>>> 0x400189fee9a000b > >>>>>>>> > >>>>>>>> at > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164) > >>>>>>>> > >>>>>>>> at > >>>>>> > >> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327) > >>>>>>>> > >>>>>>>> at > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > >>>>>>>> > >>>>>>>> at > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > >>>>>>>> > >>>>>>>> at > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > >>>>>>>> > >>>>>>>> at > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > >>>>>>>> > >>>>>>>> at java.base/java.lang.Thread.run(Thread.java:834) > >>>>>>>> > >>>>>>>> > >>>>>>>> 2021-02-15 22:05:14,428 [myid:4] - WARN > >>>>>>>> [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap, > >>>>> skipping > >>>>>>>> > >>>>>>>> > >>>>>>>> 2021-02-15 22:01:51,427 [myid:4] - INFO > >>>>>>>> [SessionTracker:ZooKeeperServer@610] - Expiring session > >>>>>>>> 0x400189fee9a001e, timeout of 4000ms exceeded > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 5. Yes we upgrade both the client and the server to 3.6. > >> Actually, > >>>>> the > >>>>>>>> issue happened with the combinations of > >>>>>>>> > >>>>>>>> > >>>>>>>> 3.4 client and 3.6 server > >>>>>>>> > >>>>>>>> 3.6 client and 3.6 server > >>>>>>>> > >>>>>>>> Please let me know if you need any additional info. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Li > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Feb 15, 2021 at 1:44 PM Li Wang <li4w...@gmail.com> > >>> wrote: > >>>>>>>> > >>>>>>>>> Hi Enrico, > >>>>>>>>> > >>>>>>>>> Thanks for the reply. > >>>>>>>>> > >>>>>>>>> 1. We are using direct NIO based stack, not Netty based yet. > >>>>>>>>> 2. Yes, on the client side, here are the metrics > >>>>>>>>> > >>>>>>>>> 3.6: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli < > >>>>> eolive...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> IIRC The main difference is about the switch to Netty 4 and > >>> about > >>>>>> using > >>>>>>>>>> more DirectMemory. Are you using the Netty based stack? > >>>>>>>>>> > >>>>>>>>>> Apart from that macro difference there have been many many > >>> changes > >>>>>>> since > >>>>>>>>>> 3.4. > >>>>>>>>>> > >>>>>>>>>> Do you have some metrics to share? > >>>>>>>>>> Are the JVM configurations and zoo.cfg configuration equals > >> to > >>>>> each > >>>>>>>>>> other? > >>>>>>>>>> > >>>>>>>>>> Do you see warnings on the server logs? > >>>>>>>>>> > >>>>>>>>>> Did you upgrade both the client and the server or only the > >>> server? > >>>>>>>>>> > >>>>>>>>>> Enrico > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Il Lun 15 Feb 2021, 18:30 Li Wang <li4w...@gmail.com> ha > >>> scritto: > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> We want to upgrade from 3.4.14 to 3.6.2. During the > >>>>> perform/load > >>>>>>>>>>> comparison test, it was found that the performance of 3.6 > >> has > >>>>> been > >>>>>>>>>>> significantly degraded compared to 3.4 for the write > >>> operation. > >>>>>> Under > >>>>>>>>>> the > >>>>>>>>>>> same load, there was a huge number of SessionExpired and > >>>>>>> ConnectionLoss > >>>>>>>>>>> errors in 3.6 while no such errors in 3.4. > >>>>>>>>>>> > >>>>>>>>>>> The load testing is 500 concurrent users with a cluster of 5 > >>>>>>>>>> participants > >>>>>>>>>>> and 5 observers. The min session timeout on the server side > >> is > >>>>>>> 4000ms. > >>>>>>>>>>> > >>>>>>>>>>> I wonder if anyone has seen the same issue and has any > >>> insights > >>>>> on > >>>>>>> what > >>>>>>>>>>> could be the cause of the performance degradation. > >>>>>>>>>>> > >>>>>>>>>>> Thanks > >>>>>>>>>>> > >>>>>>>>>>> Li > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >