Can you explain why this is posted to the Arrow mailing-list?  This
does not seem relevant to Arrow.  If indeed it isn't, please remove the
Arrow mailing-list from the recipients.

Regards

Antoine.


On Wed, 21 Apr 2021 11:25:20 +0800
shrikant kalani
<shrikantkal...@gmail.com> wrote:
> Hello Everyone,
> 
> We are also using zookeeper 3.6.2 with ssl turned on both sides. We
> observed the same behaviour where under high write load the ZK server
> starts expiring the session. There are no jvm related issues. During high
> load the max latency increases significantly.
> 
> Also the session expiration message is not accurate. We do have session
> expiration set to 40 sec but ZK server disconnects the client within 10 sec.
> 
> Also the logs prints throttling the request but ZK documentation says
> throttling is disabled by default. Can someone check the code once to see
> if it is enabled or disabled. I am not a developer and hence not familiar
> with java code.
> 
> Thanks
> Srikant Kalani
> 
> On Wed, 21 Apr 2021 at 11:03 AM, Michael Han 
> <hanm-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote:
> 
> > What is the workload looking like? Is it pure write, or mixed read write?
> >
> > A couple of ideas to move this forward:
> > * Publish the performance benchmark so the community can help.
> > * Bisect git commit and find the bad commit that caused the regression.
> > * Use the fine grained metrics introduced in 3.6 (e.g per processor stage
> > metrics) to measure where time spends during writes. We might have to add
> > these metrics on 3.4 to get a fair comparison.
> >
> > For the throttling - the RequestThrottler introduced in 3.6 does introduce
> > latency, but should not impact throughput this much.
> >
> > On Thu, Mar 11, 2021 at 11:46 AM Li Wang 
> > <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> >  
> > > The CPU usage of both server and client are normal (< 50%) during the  
> > test.  
> > >
> > > Based on the investigation, the server is too busy with the load.
> > >
> > > The issue doesn't exist in 3.4.14. I wonder why there is a significant
> > > write performance degradation from 3.4.14 to 3.6.2 and how we can address
> > > the issue.
> > >
> > > Best,
> > >
> > > Li
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Mar 11, 2021 at 11:25 AM Andor Molnar 
> > > <andor-1odqgaof3lkdnm+yrof...@public.gmane.org> wrote:
> > >  
> > > > What is the CPU usage of both server and client during the test?
> > > >
> > > > Looks like server is dropping the clients because either the server or
> > > > both are too busy to deal with the load.
> > > > This log line is also concerning: "Too busy to snap, skipping”
> > > >
> > > > If that’s the case I believe you'll have to profile the server process  
> > to  
> > > > figure out where the perf bottleneck is.
> > > >
> > > > Andor
> > > >
> > > >
> > > >
> > > >  
> > > > > On 2021. Feb 22., at 5:31, Li Wang 
> > > > > <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > > >
> > > > > Thanks, Patrick.
> > > > >
> > > > > Yes, we are using the same JVM version and GC configurations when
> > > > > running the two tests. I have checked the GC metrics and also the  
> > heap  
> > > > dump  
> > > > > of the 3.6, the GC pause and the memory usage look okay.
> > > > >
> > > > > Best,
> > > > >
> > > > > Li
> > > > >
> > > > > On Sun, Feb 21, 2021 at 3:34 PM Patrick Hunt 
> > > > > <phunt-1odqgaof3lkdnm+yrof...@public.gmane.org>  
> > wrote:  
> > > > >  
> > > > >> On Sun, Feb 21, 2021 at 3:28 PM Li Wang 
> > > > >> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > > >>  
> > > > >>> Hi Enrico, Sushant,
> > > > >>>
> > > > >>> I re-run the perf test with the data consistency check feature  
> > > disabled  
> > > > >>> (i.e. -Dzookeeper.digest.enabled=false), the write performance  
> > issue  
> > > of  
> > > > >> 3.6  
> > > > >>> is still there.
> > > > >>>
> > > > >>> With everything exactly the same, the throughput of 3.6 was only  
> > 1/2  
> > > of  
> > > > >> 3.4  
> > > > >>> and the max latency was more than 8 times.
> > > > >>>
> > > > >>> Any other points or thoughts?
> > > > >>>
> > > > >>>  
> > > > >> In the past I've noticed a big impact of GC when doing certain  
> > > > performance  
> > > > >> measurements. I assume you are using the same JVM version and GC  
> > when  
> > > > >> running the two tests? Perhaps our memory footprint has expanded  
> > over  
> > > > time.  
> > > > >> You should rule out GC by running with gc logging turned on with  
> > both  
> > > > >> versions and compare the impact.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> Patrick
> > > > >>
> > > > >>  
> > > > >>> Cheers,
> > > > >>>
> > > > >>> Li
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Sat, Feb 20, 2021 at 9:04 PM Li Wang 
> > > > >>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
> > > > >>>  
> > > > >>>> Thanks Sushant and Enrico!
> > > > >>>>
> > > > >>>> This is a really good point.  According to the 3.6 documentation,  
> > > the  
> > > > >>>> feature is disabled by default.
> > > > >>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > https://zookeeper.apache.org/doc/r3.6.2/zookeeperAdmin.html#ch_administration
> >   
> > > > >>> .  
> > > > >>>> However, checking the code, the default is enabled.
> > > > >>>>
> > > > >>>> Let me set the zookeeper.digest.enabled to false and see how the  
> > > write  
> > > > >>>> operation performs.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>>
> > > > >>>> Li
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Feb 19, 2021 at 1:32 PM Sushant Mane <  
> > > sushantma...@gmail.com>  
> > > > >>>> wrote:
> > > > >>>>  
> > > > >>>>> Hi Li,
> > > > >>>>>
> > > > >>>>> On 3.6.2 consistency checker (adhash based) is enabled by  
> > default:  
> > > > >>>>>
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > https://github.com/apache/zookeeper/blob/803c7f1a12f85978cb049af5e4ef23bd8b688715/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L136
> >   
> > > > >>>>> .
> > > > >>>>> It is not present in ZK 3.4.14.
> > > > >>>>>
> > > > >>>>> This feature does have some impact on write performance.
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Sushant
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Fri, Feb 19, 2021 at 12:50 PM Enrico Olivelli <  
> > > > eolive...@gmail.com  
> > > > >>>  
> > > > >>>>> wrote:
> > > > >>>>>  
> > > > >>>>>> Li,
> > > > >>>>>> I wonder of we have some new throttling/back pressure mechanisms 
> > > > >>>>>>  
> > > > >> that  
> > > > >>> is  
> > > > >>>>>> enabled by default.
> > > > >>>>>>
> > > > >>>>>> Does anyone has some pointer to relevant implementations?
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> Enrico
> > > > >>>>>>
> > > > >>>>>> Il Ven 19 Feb 2021, 19:46 Li Wang 
> > > > >>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> ha  
> > scritto:  
> > > > >>>>>>  
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> We switched to Netty on both client side and server side and  
> > the  
> > > > >>>>>>> performance issue is still there.  Anyone has any insights on  
> > > what  
> > > > >>>>> could  
> > > > >>>>>> be  
> > > > >>>>>>> the cause of higher latency?
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>>
> > > > >>>>>>> Li
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Mon, Feb 15, 2021 at 2:17 PM Li Wang 
> > > > >>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org>  
> > > > >> wrote:  
> > > > >>>>>>>  
> > > > >>>>>>>> Hi Enrico,
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for the reply.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 1. We are using NIO based stack, not Netty based yet.
> > > > >>>>>>>>
> > > > >>>>>>>> 2. Yes, here are some metrics on the client side.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 3.6: throughput: 7K, failure: 81215228, Avg Latency: 57ms,  
> > Max  
> > > > >>>>> Latency  
> > > > >>>>>>> 31s  
> > > > >>>>>>>>
> > > > >>>>>>>> 3.4: throughput: 15k, failure: 0,  Avg Latency: 30ms,  Max  
> > > > >>> Latency:  
> > > > >>>>>> 1.6s  
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 3. Yes, the JVM and zoo.cfg config are the exact same
> > > > >>>>>>>>
> > > > >>>>>>>> 10G of Heap
> > > > >>>>>>>>
> > > > >>>>>>>> 13G of Memory
> > > > >>>>>>>>
> > > > >>>>>>>> 5 Participante
> > > > >>>>>>>>
> > > > >>>>>>>> 5 Observere
> > > > >>>>>>>>
> > > > >>>>>>>> Client session timeout: 3000ms
> > > > >>>>>>>>
> > > > >>>>>>>> Server min session time: 4000ms
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 4. Yes, there are two types of  WARN logs and many “Expiring  
> > > > >>>>> session”  
> > > > >>>>>>>> INFO log
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 2021-02-15 22:04:36,506 [myid:4] - WARN
> > > > >>>>>>>> [NIOWorkerThread-7:NIOServerCnxn@365] - Unexpected exception
> > > > >>>>>>>>
> > > > >>>>>>>> EndOfStreamException: Unable to read additional data from  
> > > > >> client,  
> > > > >>> it  
> > > > >>>>>>>> probably closed the socket: address = /100.108.63.116:43366,  
> > > > >>>>> session =  
> > > > >>>>>>>> 0x400189fee9a000b
> > > > >>>>>>>>
> > > > >>>>>>>> at
> > > > >>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:164)
> >   
> > > > >>>>>>>>
> > > > >>>>>>>> at  
> > > > >>>>>>  
> > > > >>  
> > org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:327)  
> > > > >>>>>>>>
> > > > >>>>>>>> at
> > > > >>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
> >   
> > > > >>>>>>>>
> > > > >>>>>>>> at
> > > > >>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
> >   
> > > > >>>>>>>>
> > > > >>>>>>>> at
> > > > >>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >   
> > > > >>>>>>>>
> > > > >>>>>>>> at
> > > > >>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>  
> > > > >>  
> > > >  
> > >  
> > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >   
> > > > >>>>>>>>
> > > > >>>>>>>> at java.base/java.lang.Thread.run(Thread.java:834)
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 2021-02-15 22:05:14,428 [myid:4] - WARN
> > > > >>>>>>>> [SyncThread:4:SyncRequestProcessor@188] - Too busy to snap,  
> > > > >>>>> skipping  
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 2021-02-15 22:01:51,427 [myid:4] - INFO
> > > > >>>>>>>> [SessionTracker:ZooKeeperServer@610] - Expiring session
> > > > >>>>>>>> 0x400189fee9a001e, timeout of 4000ms exceeded
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 5. Yes we upgrade both the client and the server to 3.6.  
> > > > >> Actually,  
> > > > >>>>> the  
> > > > >>>>>>>> issue happened with the combinations of
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> 3.4 client and 3.6 server
> > > > >>>>>>>>
> > > > >>>>>>>> 3.6 client and 3.6 server
> > > > >>>>>>>>
> > > > >>>>>>>> Please let me know if you need any additional info.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks,
> > > > >>>>>>>>
> > > > >>>>>>>> Li
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> On Mon, Feb 15, 2021 at 1:44 PM Li Wang 
> > > > >>>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org>  
> > > > >>> wrote:  
> > > > >>>>>>>>  
> > > > >>>>>>>>> Hi Enrico,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks for the reply.
> > > > >>>>>>>>>
> > > > >>>>>>>>> 1. We are using direct NIO based stack, not Netty based yet.
> > > > >>>>>>>>> 2. Yes, on the client side, here are the metrics
> > > > >>>>>>>>>
> > > > >>>>>>>>> 3.6:
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Mon, Feb 15, 2021 at 10:44 AM Enrico Olivelli <  
> > > > >>>>> eolive...@gmail.com  
> > > > >>>>>>>  
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>  
> > > > >>>>>>>>>> IIRC The main difference is about the switch to Netty 4 and  
> > > > >>> about  
> > > > >>>>>> using  
> > > > >>>>>>>>>> more DirectMemory. Are you using the Netty based stack?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Apart from that macro difference there have been many many  
> > > > >>> changes  
> > > > >>>>>>> since  
> > > > >>>>>>>>>> 3.4.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Do you have some metrics to share?
> > > > >>>>>>>>>> Are the  JVM configurations and zoo.cfg configuration equals 
> > > > >>>>>>>>>>  
> > > > >> to  
> > > > >>>>> each  
> > > > >>>>>>>>>> other?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Do you see warnings on the server logs?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Did you upgrade both the client and the server or only the  
> > > > >>> server?  
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Enrico
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Il Lun 15 Feb 2021, 18:30 Li Wang 
> > > > >>>>>>>>>> <li4wang-re5jqeeqqe8avxtiumw...@public.gmane.org> ha  
> > > > >>> scritto:  
> > > > >>>>>>>>>>  
> > > > >>>>>>>>>>> Hi,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> We want to upgrade from 3.4.14 to 3.6.2.  During the  
> > > > >>>>> perform/load  
> > > > >>>>>>>>>>> comparison test,  it was found that the performance of 3.6  
> > > > >> has  
> > > > >>>>> been  
> > > > >>>>>>>>>>> significantly degraded compared to 3.4 for the write  
> > > > >>> operation.  
> > > > >>>>>> Under  
> > > > >>>>>>>>>> the  
> > > > >>>>>>>>>>> same load, there was a huge number of SessionExpired and  
> > > > >>>>>>> ConnectionLoss  
> > > > >>>>>>>>>>> errors in 3.6 while no such errors in 3.4.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> The load testing is 500 concurrent users with a cluster of  
> > 5  
> > > > >>>>>>>>>> participants  
> > > > >>>>>>>>>>> and 5 observers. The min session timeout on the server side 
> > > > >>>>>>>>>>>  
> > > > >> is  
> > > > >>>>>>> 4000ms.  
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> I wonder if anyone has seen the same issue and has any  
> > > > >>> insights  
> > > > >>>>> on  
> > > > >>>>>>> what  
> > > > >>>>>>>>>>> could be the cause of the performance degradation.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Li
> > > > >>>>>>>>>>>  
> > > > >>>>>>>>>>  
> > > > >>>>>>>>>  
> > > > >>>>>>>  
> > > > >>>>>>  
> > > > >>>>>  
> > > > >>>>  
> > > > >>>  
> > > > >>  
> > > >
> > > >  
> > >  
> >  
> 



Reply via email to