[ https://issues.apache.org/jira/browse/ZOOKEEPER-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yike Xiao updated ZOOKEEPER-4741: --------------------------------- Affects Version/s: 3.6.2 3.6.1 3.6.3 3.6.0 (was: 3.8.2) > High latency under heavy load when prometheus metrics enabled > ------------------------------------------------------------- > > Key: ZOOKEEPER-4741 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.6.0, 3.6.3, 3.6.1, 3.6.2, 3.6.4 > Environment: zookeeper version: 3.6.4 > kernel: 3.10.0-1160.95.1.el7.x86_64 > java version "1.8.0_111" > > metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider > Reporter: Yike Xiao > Priority: Major > Attachments: 32010.lock.profile.html, 32010.threaddump.001.txt, > 32010.wallclock.profile.html, image-2023-09-11-16-17-21-166.png > > > In our production, we use zookeeper built-in PrometheusMetricsProvider to > monitor zookeeper status, recently we observed very high latency in one of > our zookeeper cluster which serve heavy load. > Measured in a heavy client side, the latency could more than 25 seconds. > {noformat} > [arthas@8]$ watch org.apache.zookeeper.ClientCnxn submitRequest > Press Q or Ctrl+C to abort. > Affect(class count: 1 , method count: 1) cost in 294 ms, listenerId: 1 > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:32; [cost=28180.767833ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 > remoteserver:<IP>/<IP>:2181 lastZxid:146177807315 xid:589796 sent:589796 > recv:973892 queuedpkts:0 pendingresp:4398 queuedevents:0], > @ReplyHeader[585397,146177807315,0 > ], > ] > method=org.apache.zookeeper.ClientCnxn.submitRequest location=AtExit > ts=2023-09-06 16:01:34; [cost=29052.726493ms] result=@ArrayList[ > @Object[][isEmpty=false;size=4], > @ClientCnxn[sessionid:0x221f087c3d7eb6c local:/<IP>:40942 > remoteserver:<IP>/<IP>:2181 lastZxid:146177807862 xid:589989 sent:589989 > recv:974165 queuedpkts:0 pendingresp:4400 queuedevents:0], > @ReplyHeader[585588,146177807862,0 > ]{noformat} > > Observed many connections with *high Recv-Q* on the server side. > CommitProcWorkThread *BLOCKED* in > {{org.apache.zookeeper.server.ServerStats#updateLatency:}} > {noformat} > "CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 > tid=0x00007f86d804a000 nid=0x6bca waiting for monitor entry > [0x00007f86deb95000] > java.lang.Thread.State: BLOCKED (on object monitor) > at io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) > - waiting to lock <0x0000000784dd1a18> (a > io.prometheus.client.CKMSQuantiles) > at > io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38) > at io.prometheus.client.Summary$Child.observe(Summary.java:281) > at io.prometheus.client.Summary.observe(Summary.java:307) > at > org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355) > at > org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153) > at > org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585) > at > org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545) > at > org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Locked ownable synchronizers: > - <0x0000000734923e80> (a > java.util.concurrent.ThreadPoolExecutor$Worker){noformat} > > The wall clock profile shows that there is lock contention within > `CommitProcWorkThread` threads. -- This message was sent by Atlassian Jira (v8.20.10#820010)