I feel like I'm missing something fairly fundamental here. I'm building a clustered application that uses ZooKeeper (3.3.1) to store its configuration information. There are 33 nodes in the cluster (Amazon EC2 instance, if that matters), and I'm currently using a single ZooKeeper instance. When a node starts up, it makes a connection to ZK, sets the data on a few paths and makes an ephemeral node for itself. I keep the connection open while the node is running so that I can use watches to find out if a node disappears, but after the initial setup, the node usually won't write or read anything from ZK.
My understanding (having had a quick look at the code) is that the client connection will send a ping every sessionTimeout * 2/3 ms or so to keep the session alive, but I keep seeing sessions dying. On the client side I'll see something like the following sequence of events: [05/21/10 15:59:40.753] INFO Initiating client connection, connectString=zookeeper:2200 sessionTimeout=30000 watcher=com.echonest.cluster.zoocontai...@1eb3319f [05/21/10 15:59:40.767] INFO Socket connection established to zookeeper/10.255.9.187:2200, initiating session [05/21/10 15:59:40.787] INFO Session establishment complete on server zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004c, negotiated timeout = 30000 [05/21/10 16:13:03.729] INFO Client session timed out, have not heard from server in 33766ms for sessionid 0x128bb7b828d004c, closing socket connection and attempting reconnect [05/21/10 16:13:19.268] INFO Initiating client connection, connectString=zookeeper:2200 sessionTimeout=30000 watcher=com.echonest.cluster.zoocontai...@1eb3319f [05/21/10 16:14:12.326] INFO Client session timed out, have not heard from server in 53058ms for sessionid 0x128bb7b828d004c, closing socket connection and attempting reconnect [05/21/10 16:14:29.810] INFO Socket connection established to zookeeper/10.255.9.187:2200, initiating session [05/21/10 16:14:29.814] INFO Session establishment complete on server zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004f, negotiated timeout = 30000 While on the server side, I'll see something like: 2010-05-21 15:59:40,775 - INFO [SyncThread:0:nioserverc...@1579] - Established session 0x128bb7b828d004c with negotiated timeout 30000 for client /10.198.98.3:50376 2010-05-21 15:59:40,799 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x128bb7b828d004c type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/clusters Error:KeeperErrorCode = NodeExists for /clusters 2010-05-21 15:59:40,825 - INFO [ProcessThread:-1:preprequestproces...@405] - Got user-level KeeperException when processing sessionid:0x128bb7b828d004c type:delete cxid:0xc zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/clusters/echostore/groups/01222100-81db-439e-a3df-90e061166e44/master Error:KeeperErrorCode = NoNode for /clusters/echostore/groups/01222100-81db-439e-a3df-90e061166e44/master 2010-05-21 16:13:03,732 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2200:nioserverc...@633] - EndOfStreamException: Unable to read additional data from client sessionid 0x128bb7b828d004c, likely client has closed socket 2010-05-21 16:13:03,733 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2200:nioserverc...@1434] - Closed socket connection for client /10.198.98.3:50376 which had sessionid 0x128bb7b828d004c 2010-05-21 16:13:20,012 - INFO [SessionTracker:zookeeperser...@315] - Expiring session 0x128bb7b828d004c, timeout of 30000ms exceeded 2010-05-21 16:13:20,013 - INFO [ProcessThread:-1:preprequestproces...@385] - Processed session termination for sessionid: 0x128bb7b828d004c If I'm reading this correctly, the connection gets set up and then the server experiences an error trying to read from the client, so it closes the connection. It's not clear if this causes the session timeout or vice-versa (these systems are both running ntp, but I doubt that we can count on interleaving those log times correctly.) I started out with a session timeout of 10,000ms, but as you can see, I have the same problem at 30,000ms. Do I have a fundamental misunderstanding? What else should I do to figure out what's going on here? Steve