Re: Ping and client session timeouts

Patrick Hunt Fri, 21 May 2010 11:16:16 -0700

Hi Stephen, my comments inline below:

On 05/21/2010 09:31 AM, Stephen Green wrote:

I feel like I'm missing something fairly fundamental here.  I'm
building a clustered application that uses ZooKeeper (3.3.1) to store
its configuration information.  There are 33 nodes in the cluster
(Amazon EC2 instance, if that matters), and I'm currently using a
single ZooKeeper instance. When a node starts up, it makes a
connection to ZK, sets the data on a few paths and makes an ephemeral
node for itself. I keep the connection open while the node is running
so that I can use watches to find out if a node disappears, but after
the initial setup, the node usually won't write or read anything from
ZK.


My understanding (having had a quick look at the code) is that the
client connection will send a ping every sessionTimeout * 2/3 ms or so
to keep the session alive, but I keep seeing sessions dying.  On the

Actually the client sends a ping every 1/3 the timeout, and then looksfor a response before another 1/3 elapses. This allows time to reconnectto a different server (and still maintain the session) if the currentserver were to become unavailable.

client side I'll see something like the following sequence of events:

[05/21/10 15:59:40.753] INFO Initiating client connection,
connectString=zookeeper:2200 sessionTimeout=30000
watcher=com.echonest.cluster.zoocontai...@1eb3319f
[05/21/10 15:59:40.767] INFO Socket connection established to
zookeeper/10.255.9.187:2200, initiating session
[05/21/10 15:59:40.787] INFO Session establishment complete on server
zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004c, negotiated
timeout = 30000


Ok, this (^^^) says that the timeout is set to 30sec.

[05/21/10 16:13:03.729] INFO Client session timed out, have not heard
from server in 33766ms for sessionid 0x128bb7b828d004c, closing socket
connection and attempting reconnect
[05/21/10 16:13:19.268] INFO Initiating client connection,
connectString=zookeeper:2200 sessionTimeout=30000
watcher=com.echonest.cluster.zoocontai...@1eb3319f
[05/21/10 16:14:12.326] INFO Client session timed out, have not heard
from server in 53058ms for sessionid 0x128bb7b828d004c, closing socket
connection and attempting reconnect

This (^^^) is very suspicious, in particular "have not heard from serverin 53058ms". This means that the client heartbeat code didn't noticethat the heartbeat was exceeded for 53 seconds! This should neverhappen, the client does a "select" with a timeout of 1/3 the sessiontimeout (10sec here).

The fact that the select is taking 43 addl seconds (53-10sec selecttimeout) tells me that your client jvm is not allowing the heartbeatthread to run. The most common reason for this is GC. Is your clientapplication very memory intensive? Heavy on GC? You should turn on yourGC logging and review the output after reproducing this issue (turningon CMS/incremental GC mode usually resolves this issue, but you shouldverify first). What we typically see here is that the client JVM isrunning GC for very long periods of time, this blocks all the threads,and as a result the heartbeat is not sent by the client!

As you are running in a virtualized environment this could also be afactor (it's def an issue from a GC perspective). But I suspect that gcis the issue here, look at that first.

See this page for some common issues users have faced in the past:http://bit.ly/5WwS44


If I'm reading this correctly, the connection gets set up and then the
server experiences an error trying to read from the client, so it
closes the connection.  It's not clear if this causes the session
timeout or vice-versa (these systems are both running ntp, but I doubt
that we can count on interleaving those log times correctly.)

Yes, this is due to the client not getting the heartbeat, so it willclose the connection and attempt to reestablish.

I started out with a session timeout of 10,000ms, but as you can see,
I have the same problem at 30,000ms.


You may not need to use this after you resolve the GC issue.

Do I have a fundamental misunderstanding? What else should I do to
figure out what's going on here?

As I suggested above, give GC logging a try. I found 'gchisto' a veryuseful tool for reviewing the resulting log files.

http://sysadminsjourney.com/content/2008/09/15/profile-your-java-gc-performance-gchisto

Regards,

Patrick

Re: Ping and client session timeouts

Reply via email to