I feel like I'm missing something fairly fundamental here.  I'm
building a clustered application that uses ZooKeeper (3.3.1) to store
its configuration information.  There are 33 nodes in the cluster
(Amazon EC2 instance, if that matters), and I'm currently using a
single ZooKeeper instance. When a node starts up, it makes a
connection to ZK, sets the data on a few paths and makes an ephemeral
node for itself. I keep the connection open while the node is running
so that I can use watches to find out if a node disappears, but after
the initial setup, the node usually won't write or read anything from
ZK.

My understanding (having had a quick look at the code) is that the
client connection will send a ping every sessionTimeout * 2/3 ms or so
to keep the session alive, but I keep seeing sessions dying.  On the
client side I'll see something like the following sequence of events:

[05/21/10 15:59:40.753] INFO Initiating client connection,
connectString=zookeeper:2200 sessionTimeout=30000
watcher=com.echonest.cluster.zoocontai...@1eb3319f
[05/21/10 15:59:40.767] INFO Socket connection established to
zookeeper/10.255.9.187:2200, initiating session
[05/21/10 15:59:40.787] INFO Session establishment complete on server
zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004c, negotiated
timeout = 30000
[05/21/10 16:13:03.729] INFO Client session timed out, have not heard
from server in 33766ms for sessionid 0x128bb7b828d004c, closing socket
connection and attempting reconnect
[05/21/10 16:13:19.268] INFO Initiating client connection,
connectString=zookeeper:2200 sessionTimeout=30000
watcher=com.echonest.cluster.zoocontai...@1eb3319f
[05/21/10 16:14:12.326] INFO Client session timed out, have not heard
from server in 53058ms for sessionid 0x128bb7b828d004c, closing socket
connection and attempting reconnect
[05/21/10 16:14:29.810] INFO Socket connection established to
zookeeper/10.255.9.187:2200, initiating session
[05/21/10 16:14:29.814] INFO Session establishment complete on server
zookeeper/10.255.9.187:2200, sessionid = 0x128bb7b828d004f, negotiated
timeout = 30000

While on the server side, I'll see something like:

2010-05-21 15:59:40,775 - INFO  [SyncThread:0:nioserverc...@1579] -
Established session 0x128bb7b828d004c with negotiated timeout 30000
for client /10.198.98.3:50376
2010-05-21 15:59:40,799 - INFO
[ProcessThread:-1:preprequestproces...@405] - Got user-level
KeeperException when processing sessionid:0x128bb7b828d004c
type:create cxid:0x1 zxid:0xfffffffffffffffe txntype:unknown
reqpath:n/a Error Path:/clusters Error:KeeperErrorCode = NodeExists
for /clusters
2010-05-21 15:59:40,825 - INFO
[ProcessThread:-1:preprequestproces...@405] - Got user-level
KeeperException when processing sessionid:0x128bb7b828d004c
type:delete cxid:0xc zxid:0xfffffffffffffffe txntype:unknown
reqpath:n/a Error
Path:/clusters/echostore/groups/01222100-81db-439e-a3df-90e061166e44/master
Error:KeeperErrorCode = NoNode for
/clusters/echostore/groups/01222100-81db-439e-a3df-90e061166e44/master
2010-05-21 16:13:03,732 - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2200:nioserverc...@633] -
EndOfStreamException: Unable to read additional data from client
sessionid 0x128bb7b828d004c, likely client has closed socket
2010-05-21 16:13:03,733 - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2200:nioserverc...@1434] -
Closed socket connection for client /10.198.98.3:50376 which had
sessionid 0x128bb7b828d004c
2010-05-21 16:13:20,012 - INFO  [SessionTracker:zookeeperser...@315] -
Expiring session 0x128bb7b828d004c, timeout of 30000ms exceeded
2010-05-21 16:13:20,013 - INFO
[ProcessThread:-1:preprequestproces...@385] - Processed session
termination for sessionid: 0x128bb7b828d004c

If I'm reading this correctly, the connection gets set up and then the
server experiences an error trying to read from the client, so it
closes the connection.  It's not clear if this causes the session
timeout or vice-versa (these systems are both running ntp, but I doubt
that we can count on interleaving those log times correctly.)

I started out with a session timeout of 10,000ms, but as you can see,
I have the same problem at 30,000ms.

Do I have a fundamental misunderstanding? What else should I do to
figure out what's going on here?

Steve

Reply via email to