Viktor Feklin created ZOOKEEPER-4613:
----------------------------------------
Summary: Zookeper client suddenly stops sending ping requests in
idle state
Key: ZOOKEEPER-4613
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4613
Project: ZooKeeper
Issue Type: Bug
Components: java client
Affects Versions: 3.5.7
Reporter: Viktor Feklin
Attachments: clint-lost-ping.log
Expected behavior: from documentation: "The session is kept alive by requests
sent by the client. If the session is idle for a period of time that would
timeout the session, the client will send a PING request to keep the session
alive."
Actual behavior: periodically client not sending any ping after last
successfuly request.
Client operates as expected (handling server notifications and sending pings
during idle periods). But after handle another notification it suddenly stops
sending any ping (no matter how long idle interval last). Client sleeps until
external events is happend (server notification, or program call).
After wakeup client starts to check session timeouts, detects long pause and
triggers connection to state SUSPENDED. Client "closing socket connection and
attempting reconnect". If period of inactivity was smaller than the session
timeout (session is alive) connection triggers to state RECONNECTED, otherwise
new seession created and connection goes to state LOST and then to state
RECONNECTED.
In the attached log:
1) We have multiple clients working in one JVM.
2) There is a number of zookeeper servers in a cluster.
3) Zookeeper client with seession 0x3008266486f2172 connected to
10.36.193.111:2181 with negotiated timeout = 30_000
4) Client handles last notification at:
{code:java}
15:19:48,874 (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))
DEBUG org.apache.zookeeper.ClientCnxn$SendThread:923 - Reading reply
sessionid:0x3008266486f2172, packet::
clientPath:/tc/default/supervisors/wf_1057
serverPath:/tc/default/supervisors/wf_1057 finished:false header:: 6131,4
replyHeader:: 6131,42964550082,0 request:: '/tc/default/supervisors/wf_1057,T
response::
#***,s{42964550082,42964550082,1663244388854,1663244388854,0,0,0,144258549042716267,897,0,42964550082}{code}
5) Client shows no activity from this moment until watcher notification from
server (client wakeups and detects session timeout):
{code:java}
15:20:08,884 INFO [stdout]
(mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181)) 15:20:08 WARN
org.apache.zookeeper.ClientCnxn$SendThread:1190 - Client session timed out,
have not heard from server in 20010ms for sessionid 0x3008266486f2172
15:20:08,885 INFO [stdout]
(mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181)) 15:20:08 INFO
org.apache.zookeeper.ClientCnxn$SendThread:1238 - Client session timed out,
have not heard from server in 20010ms for sessionid 0x3008266486f2172, closing
socket connection and attempting reconnect{code}
6) Ping period = 10 seconds (actualy there were no pings in 20 seconds).
7) Annother clients (running in the same jvm and connected to the same zk
server) operate normaly while broken client do nothing.
{code:java}
15:19:58,885 INFO [stdout]
(mt-[15H15M11S3617]-[module-5]-SendThread(10.36.193.111:2181)) 15:19:58 DEBUG
org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for
sessionid: 0x3008266486f216b after 0ms
15:19:58,975 INFO [stdout]
(mt-[15H15M11S3617]-[module-4]-SendThread(10.36.193.111:2181)) 15:19:58 DEBUG
org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for
sessionid: 0x3008266486f2171 after 1ms{code}
So we can claim:
* there is no long GC pauses on client (we also has GC log for period of
incident without any issues)
* there is no problem with zk server (no reboots etc, cause it successfuly
handles another clints requests)
We continuously see this behavior after zk version upgrade:
zk client 3.4.8 -> 3.5.7 (curator 2.12.0 -> 4.3.0)
zk server upgrade to 3.5.8 (runing with default configuration)
We have state recovery after session exiration - so connection loss trigger
nodes recreation. Another clients react to add/remove events (doing some
computations). Having thousands of clients repeadly reconnecting - we get
anomaly work load even in idle state...
--
This message was sent by Atlassian Jira
(v8.20.10#820010)