Another couple of things I found:
*A couple of Zookeeper client threads are stuck at these stacktraces for
~30 seconds*
"pool-2-thread-1-EventThread" #1218 daemon prio=5 os_prio=0
tid=0x00007ff3f5e23800 nid=0x5cd8 waiting on condition [0x00007ff3ef803000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000018d6d8ed8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:504)
"pool-2-thread-1-SendThread(72.55.136.25:2181)" #1217 daemon prio=5
os_prio=0 tid=0x00007ff3f5e23000 nid=0x5cd7 runnable [0x00007ff3ef904000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x000000018d68a730> (a sun.nio.ch.Util$3)
- locked <0x000000018d68a720> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000018d68a258> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145)
*Running the mntr command returned the following stats*
zk_version 3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03, built on
06/29/2018 04:05 GMT
zk_avg_latency 0
zk_max_latency *17657*
zk_min_latency 0
zk_packets_received 1427134
zk_packets_sent 1596974
zk_num_alive_connections 64
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 1394
zk_watch_count 592
zk_ephemerals_count 192
zk_approximate_data_size 181257
zk_open_file_descriptor_count 94
zk_max_file_descriptor_count 1048576
zk_fsync_threshold_exceed_count 1
I find the *zk_max_latency* extremely hight. I'm wondering what kind of
latency is that ? How can I debug the reason for this value ?
Thanks,
Gelbana
On Wed, Apr 3, 2019 at 1:42 PM Muhammad Gelbana <[email protected]> wrote:
> I'm trying to debug a problem where our client application suddenly loses
> its Zookeeper session. I concluded that by looking at the Zookeeper server
> logs.
>
> I increased the logging details for the client and found the following log
> messages
>
>> DEBUG: [07:33:33] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:34:07] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:34:40] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:35:13] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:35:47] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>> DEBUG: [07:36:20] [demo | HA | Manager] Got ping response for sessionid:
>> 0x3000da76fa904b6 after 0ms
>> [org.apache.zookeeper.ClientCnxn$SendThread.readResponse]
>>
>
> I noticed that the duration between each log message is ~33 seconds while
> on another environment (my laptop), the duration goes down to ~1 second.
> What could be causing this huge difference ? I doubt that whatever is
> causing this effect causes the delay to increase significantly at some
> point to the extend that makes my client lose its session.
>