Yike Xiao created ZOOKEEPER-4741:
------------------------------------
Summary: High latency under heavy load when prometheus metrics
enabled
Key: ZOOKEEPER-4741
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4741
Project: ZooKeeper
Issue Type: Bug
Components: server
Affects Versions: 3.8.2, 3.6.4
Environment: zookeeper version: 3.6.4
kernel: 3.10.0-1160.95.1.el7.x86_64
java version "1.8.0_111"
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
Reporter: Yike Xiao
Attachments: 32010.threaddump.001.txt, 32010.wallclock.profile.html,
image-2023-09-11-16-17-21-166.png
In our production, we use zookeeper built-in PrometheusMetricsProvider to
monitor zookeeper status, recently we observed very high latency in one of our
zookeeper cluster which serve heavy load.
Measured in a heavy client side, the latency could more than 25 seconds.
!image-2023-09-11-16-17-21-166.png! Observed many connections with high Recv-Q
on the server side.
CommitProcWorkThread *BLOCKED* in
{{{}org.apache.zookeeper.server.ServerStats#updateLatency:{}}}{{{}{}}}
"CommitProcWorkThread-15" #21595 daemon prio=5 os_prio=0 tid=0x00007f86d804a000
nid=0x6bca waiting for monitor entry [0x00007f86deb95000]
java.lang.Thread.State: BLOCKED (on object monitor) at
io.prometheus.client.CKMSQuantiles.insert(CKMSQuantiles.java:91) -
waiting to lock <0x0000000784dd1a18> (a io.prometheus.client.CKMSQuantiles)
at
io.prometheus.client.TimeWindowQuantiles.insert(TimeWindowQuantiles.java:38)
at io.prometheus.client.Summary$Child.observe(Summary.java:281) at
io.prometheus.client.Summary.observe(Summary.java:307) at
org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider$PrometheusSummary.add(PrometheusMetricsProvider.java:355)
at
org.apache.zookeeper.server.ServerStats.updateLatency(ServerStats.java:153)
at
org.apache.zookeeper.server.FinalRequestProcessor.updateStats(FinalRequestProcessor.java:669)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:585)
at
org.apache.zookeeper.server.quorum.CommitProcessor$CommitWorkRequest.doWork(CommitProcessor.java:545)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The wall lock profile shows that there is lock contention within the
{{CommitProcWorkThread}} threads.
!https://gitlab.dev.zhaopin.com/sucheng.wang/notes/uploads/b9da2552d6b00c3f9130d87caf01325e/image.png!
{{}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)