[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195252#comment-13195252 ]
Benjamin Reed commented on ZOOKEEPER-1367: ------------------------------------------ pat and i just finished up a debug session (using his super cool snapshot dump utility :) we found it (actually i see camille is also there too :) the problem is that the create session happens while server 1 is not connected, but server 1 is not far behind server 2, so when server 1 connects to server 2 it sends a diff. this diff gets processed in Learner.syncWithLeader(). in that method as it processes the commited transactions of the diff it applies them to the dataTree. but the dataTree does not update the session map! thus in final request processor we have: {code} rc = zks.getZKDatabase().processTxn(request.getHdr(), request.getTxn()); if (request.type == OpCode.createSession) { if (request.getTxn() instanceof CreateSessionTxn) { CreateSessionTxn cst = (CreateSessionTxn) request.getTxn(); zks.sessionTracker.addSession(request.sessionId, cst .getTimeOut()); } else { LOG.warn("*****>>>>> Got " + request.getTxn().getClass() + " " + request.getTxn().toString()); } } else if (request.type == OpCode.closeSession) { zks.sessionTracker.removeSession(request.sessionId); } {code} because we don't have the special handling for session ops in that method, 1 doesn't track them, so when it becomes the leader the session is missing. the fix is rather simple as is the test case. i'll have both up tonight. thanx everyone for the help in debugging. and jeremy thanx for providing such great logs and traces and recreating the problem! > Data inconsistencies and unexpired ephemeral nodes after cluster restart > ------------------------------------------------------------------------ > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit > Reporter: Jeremy Stribling > Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee0000000011] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee0000000006 nominee0000000010 nominee0000000011 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010 > 90.0.0.221:7777 > cZxid = 0x30000014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x30000014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x30000014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220000 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006 > 90.0.0.223:7777 > cZxid = 0x200000cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x200000cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x200000cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira