Can you explain why this is posted to the Arrow mailing-list? This does not seem relevant to Arrow. If indeed it isn't, please remove the Arrow mailing-list from the recipients.
Regards Antoine. On Wed, 21 Apr 2021 11:25:20 +0800 shrikant kalani <shrikantkal...@gmail.com> wrote: > Hello Everyone, > > We are also using zookeeper 3.6.2 with ssl turned on both sides. We > observed the same behaviour where under high write load the ZK server > starts expiring the session. There are no jvm related issues. During high > load the max latency increases significantly. > > Also the session expiration message is not accurate. We do have session > expiration set to 40 sec but ZK server disconnects the client within 10 sec. > > Also the logs prints throttling the request but ZK documentation says > throttling is disabled by default. Can someone check the code once to see > if it is enabled or disabled. I am not a developer and hence not familiar > with java code. > > Thanks > Srikant Kalani > > On Wed, 21 Apr 2021 at 11:03 AM, Michael Han > <hanm-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote: > > > What is the workload looking like? Is it pure write, or mixed read write? > > > > A couple of ideas to move this forward: > > * Publish the performance benchmark so the community can help. > > * Bisect git commit and find the bad commit that caused the regression. > > * Use the fine grained metrics introduced in 3.6 (e.g per processor stage > > metrics) to measure where time spends during writes. We might have to add > > these metrics on 3.4 to get a fair comparison. > > > > For the throttling - the RequestThrottler introduced in 3.6 does introduce > > latency, but should not impact throughput this much. > > > > On Thu, Mar 11, 2021 at 11:46 AM Li Wang > > <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > > The CPU usage of both server and client are normal (< 50%) during the > > test. > > > > > > Based on the investigation, the server is too busy with the load. > > > > > > The issue doesn't exist in 3.4.14. I wonder why there is a significant > > > write performance degradation from 3.4.14 to 3.6.2 and how we can address > > > the issue. > > > > > > Best, > > > > > > Li > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Mar 11, 2021 at 11:25 AM Andor Molnar > > > <andor-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote: > > > > > > > What is the CPU usage of both server and client during the test? > > > > > > > > Looks like server is dropping the clients because either the server or > > > > both are too busy to deal with the load. > > > > This log line is also concerning: "Too busy to snap, skipping” > > > > > > > > If that’s the case I believe you'll have to profile the server process > > to > > > > figure out where the perf bottleneck is. > > > > > > > > Andor > > > > > > > > > > > > > > > > > > > > > On 2021. Feb 22., at 5:31, Li Wang > > > > > <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > > > > > > > Thanks, Patrick. > > > > > > > > > > Yes, we are using the same JVM version and GC configurations when > > > > > running the two tests. I have checked the GC metrics and also the > > heap > > > > dump > > > > > of the 3.6, the GC pause and the memory usage look okay. > > > > > > > > > > Best, > > > > > > > > > > Li > > > > > > > > > > On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt > > > > > <phunt-1odqgaof3lkdnm+yrof...@public.gmane.org> > > wrote: > > > > > > > > > >> On Sun, Feb 21, 2021 at 3:28 PM Li Wang > > > > >> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > >> > > > > >>> Hi Enrico, Sushant, > > > > >>> > > > > >>> I re-run the perf test with the data consistency check feature > > > disabled > > > > >>> (i.e. -Dzookeeper.digest.enabled=false), the write performance > > issue > > > of > > > > >> 3.6 > > > > >>> is still there. > > > > >>> > > > > >>> With everything exactly the same, the throughput of 3.6 was only > > 1/2 > > > of > > > > >> 3.4 > > > > >>> and the max latency was more than 8 times. > > > > >>> > > > > >>> Any other points or thoughts? > > > > >>> > > > > >>> > > > > >> In the past I've noticed a big impact of GC when doing certain > > > > performance > > > > >> measurements. I assume you are using the same JVM version and GC > > when > > > > >> running the two tests? Perhaps our memory footprint has expanded > > over > > > > time. > > > > >> You should rule out GC by running with gc logging turned on with > > both > > > > >> versions and compare the impact. > > > > >> > > > > >> Regards, > > > > >> > > > > >> Patrick > > > > >> > > > > >> > > > > >>> Cheers, > > > > >>> > > > > >>> Li > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> On Sat, Feb 20, 2021 at 9:04 PM Li Wang > > > > >>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote: > > > > >>> > > > > >>>> Thanks Sushant and Enrico! > > > > >>>> > > > > >>>> This is a really good point. According to the 3.6 documentation, > > > the > > > > >>>> feature is disabled by default. > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration > > > > > > >>> . > > > > >>>> However, checking the code, the default is enabled. > > > > >>>> > > > > >>>> Let me set the zookeeper.digest.enabled to false and see how the > > > write > > > > >>>> operation performs. > > > > >>>> > > > > >>>> Best, > > > > >>>> > > > > >>>> Li > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane < > > > sushantma...@gmail.com> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Hi Li, > > > > >>>>> > > > > >>>>> On 3.6.2 consistency checker (adhash based) is enabled by > > default: > > > > >>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136 > > > > > > >>>>> . > > > > >>>>> It is not present in ZK 3.4.14. > > > > >>>>> > > > > >>>>> This feature does have some impact on write performance. > > > > >>>>> > > > > >>>>> Thanks, > > > > >>>>> Sushant > > > > >>>>> > > > > >>>>> > > > > >>>>> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli < > > > > eolive...@gmail.com > > > > >>> > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>>> Li, > > > > >>>>>> I wonder of we have some new throttling/back pressure mechanisms > > > > >>>>>> > > > > >> that > > > > >>> is > > > > >>>>>> enabled by default. > > > > >>>>>> > > > > >>>>>> Does anyone has some pointer to relevant implementations? > > > > >>>>>> > > > > >>>>>> > > > > >>>>>> Enrico > > > > >>>>>> > > > > >>>>>> Il Ven 19 Feb 2021, 19:46 Li Wang > > > > >>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> ha > > scritto: > > > > >>>>>> > > > > >>>>>>> Hi, > > > > >>>>>>> > > > > >>>>>>> We switched to Netty on both client side and server side and > > the > > > > >>>>>>> performance issue is still there. Anyone has any insights on > > > what > > > > >>>>> could > > > > >>>>>> be > > > > >>>>>>> the cause of higher latency? > > > > >>>>>>> > > > > >>>>>>> Thanks, > > > > >>>>>>> > > > > >>>>>>> Li > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> > > > > >>>>>>> On Mon, Feb 15, 2021 at 2:17 PM Li Wang > > > > >>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> > > > > >> wrote: > > > > >>>>>>> > > > > >>>>>>>> Hi Enrico, > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> Thanks for the reply. > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 1. We are using NIO based stack, not Netty based yet. > > > > >>>>>>>> > > > > >>>>>>>> 2. Yes, here are some metrics on the client side. > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms, > > Max > > > > >>>>> Latency > > > > >>>>>>> 31s > > > > >>>>>>>> > > > > >>>>>>>> 3.4: throughput: 15k, failure: 0, Avg Latency: 30ms, Max > > > > >>> Latency: > > > > >>>>>> 1.6s > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 3. Yes, the JVM and zoo.cfg config are the exact same > > > > >>>>>>>> > > > > >>>>>>>> 10G of Heap > > > > >>>>>>>> > > > > >>>>>>>> 13G of Memory > > > > >>>>>>>> > > > > >>>>>>>> 5 Participante > > > > >>>>>>>> > > > > >>>>>>>> 5 Observere > > > > >>>>>>>> > > > > >>>>>>>> Client session timeout: 3000ms > > > > >>>>>>>> > > > > >>>>>>>> Server min session time: 4000ms > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 4. Yes, there are two types of WARN logs and many “Expiring > > > > >>>>> session” > > > > >>>>>>>> INFO log > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 2021-02-15 22:04:36,506 [myid:4] - WARN > > > > >>>>>>>> [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception > > > > >>>>>>>> > > > > >>>>>>>> EndOfStreamException: Unable to read additional data from > > > > >> client, > > > > >>> it > > > > >>>>>>>> probably closed the socket: address = /100.108.63.116:43366, > > > > >>>>> session = > > > > >>>>>>>> 0x400189fee9a000b > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164) > > > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>> > > > > >> > > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327) > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) > > > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > > > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > > > > > > >>>>>>>> > > > > >>>>>>>> at > > > > >>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>> > > > > >> > > > > > > > > > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > > > > > > >>>>>>>> > > > > >>>>>>>> at java.base/java.lang.Thread.run(Thread.java:834) > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 2021-02-15 22:05:14,428 [myid:4] - WARN > > > > >>>>>>>> [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap, > > > > >>>>> skipping > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 2021-02-15 22:01:51,427 [myid:4] - INFO > > > > >>>>>>>> [SessionTracker:ZooKeeperServer@610] - Expiring session > > > > >>>>>>>> 0x400189fee9a001e, timeout of 4000ms exceeded > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 5. Yes we upgrade both the client and the server to 3.6. > > > > >> Actually, > > > > >>>>> the > > > > >>>>>>>> issue happened with the combinations of > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> 3.4 client and 3.6 server > > > > >>>>>>>> > > > > >>>>>>>> 3.6 client and 3.6 server > > > > >>>>>>>> > > > > >>>>>>>> Please let me know if you need any additional info. > > > > >>>>>>>> > > > > >>>>>>>> Thanks, > > > > >>>>>>>> > > > > >>>>>>>> Li > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> > > > > >>>>>>>> On Mon, Feb 15, 2021 at 1:44 PM Li Wang > > > > >>>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> > > > > >>> wrote: > > > > >>>>>>>> > > > > >>>>>>>>> Hi Enrico, > > > > >>>>>>>>> > > > > >>>>>>>>> Thanks for the reply. > > > > >>>>>>>>> > > > > >>>>>>>>> 1. We are using direct NIO based stack, not Netty based yet. > > > > >>>>>>>>> 2. Yes, on the client side, here are the metrics > > > > >>>>>>>>> > > > > >>>>>>>>> 3.6: > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>>>> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli < > > > > >>>>> eolive...@gmail.com > > > > >>>>>>> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> IIRC The main difference is about the switch to Netty 4 and > > > > >>> about > > > > >>>>>> using > > > > >>>>>>>>>> more DirectMemory. Are you using the Netty based stack? > > > > >>>>>>>>>> > > > > >>>>>>>>>> Apart from that macro difference there have been many many > > > > >>> changes > > > > >>>>>>> since > > > > >>>>>>>>>> 3.4. > > > > >>>>>>>>>> > > > > >>>>>>>>>> Do you have some metrics to share? > > > > >>>>>>>>>> Are the JVM configurations and zoo.cfg configuration equals > > > > >>>>>>>>>> > > > > >> to > > > > >>>>> each > > > > >>>>>>>>>> other? > > > > >>>>>>>>>> > > > > >>>>>>>>>> Do you see warnings on the server logs? > > > > >>>>>>>>>> > > > > >>>>>>>>>> Did you upgrade both the client and the server or only the > > > > >>> server? > > > > >>>>>>>>>> > > > > >>>>>>>>>> Enrico > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> Il Lun 15 Feb 2021, 18:30 Li Wang > > > > >>>>>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> ha > > > > >>> scritto: > > > > >>>>>>>>>> > > > > >>>>>>>>>>> Hi, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> We want to upgrade from 3.4.14 to 3.6.2. During the > > > > >>>>> perform/load > > > > >>>>>>>>>>> comparison test, it was found that the performance of 3.6 > > > > >> has > > > > >>>>> been > > > > >>>>>>>>>>> significantly degraded compared to 3.4 for the write > > > > >>> operation. > > > > >>>>>> Under > > > > >>>>>>>>>> the > > > > >>>>>>>>>>> same load, there was a huge number of SessionExpired and > > > > >>>>>>> ConnectionLoss > > > > >>>>>>>>>>> errors in 3.6 while no such errors in 3.4. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> The load testing is 500 concurrent users with a cluster of > > 5 > > > > >>>>>>>>>> participants > > > > >>>>>>>>>>> and 5 observers. The min session timeout on the server side > > > > >>>>>>>>>>> > > > > >> is > > > > >>>>>>> 4000ms. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> I wonder if anyone has seen the same issue and has any > > > > >>> insights > > > > >>>>> on > > > > >>>>>>> what > > > > >>>>>>>>>>> could be the cause of the performance degradation. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Thanks > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> Li > > > > >>>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>> > > > > >>>>>>> > > > > >>>>>> > > > > >>>>> > > > > >>>> > > > > >>> > > > > >> > > > > > > > > > > > > > >