[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193507#comment-13193507 ]
Jeremy Stribling commented on ZOOKEEPER-1367: --------------------------------------------- QA was able to reproduce it. The good news is that NTP is working properly, so the logs should be a bit saner; also, I was able to salvage all the logs for the node that has the incorrectly-unexpired session. The bad news I don't have any usable logs for one of the nodes, and for the other one I only have the most recent logs. I hope this helps at least a little. This time around, all three nodes are live when the problem happens. There are three nodes: 90.0.0.1/208/node1, 90.0.0.2/30/node2, 90.0.0.3/119/node3. node2 is the one in the bad state: {noformat} [zk: 90.0.0.2:2888(CONNECTED) 0] ls /election/zkrsm [nominee0000000034, nominee0000000035, nominee0000000032, nominee0000000036] [zk: 90.0.0.2:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000032 90.0.0.1:7777 cZxid = 0xe00000017 ctime = Wed Jan 25 20:53:15 UTC 2012 mZxid = 0xe00000017 mtime = Wed Jan 25 20:53:15 UTC 2012 pZxid = 0xe00000017 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x1e3516a4bb770000 dataLength = 14 numChildren = 0 {noformat} /election/zkrsm/nominee0000000032 is owned by a session (0x1e3516a4bb770000) that should be expired. Connecting to the other two nodes show only the latest three znodes. I have uploaded a tar file containing the datadirs of all three nodes, the logs for all runs of node2, and the logs of the latest run of node3, to: http://pdos.lcs.mit.edu/~strib/ZOOKEEPER-1367-2.tgz (80 MB zipped, 2.8 GB unzipped) Here's how the logs break down: * run0.tgz: You can ignore this one, it's from a preliminary run that was not part of the cluster * run1.tgz: This is where the cluster is built up using restarts. In node2's case, it is first started with two nodes (node1 and node2). Then that QuorumPeer is shutdown, and restarted with a configuration of all three nodes. * run[2-8].tgz: Each represents a cluster restart. No cluster re-configurations are done during this time. Our whole application process is brought down and then back up using the same ZK configuration. * run7.tgz: This is the run where the bad session was created and live: {noformat} 2012-01-25 20:53:09,948 505755 [ProcessThread(sid:119 cport:-1):] TRACE org.apache.zookeeper.server.PrepRequestProcessor - :Psessionid:0x1e3516a4bb770000 type:createSession cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a {noformat} * run8.tgz: The session created in the previous run of the system lives on, and is never expired as far as I can tell: {noformat} 2012-01-25 20:53:27,950 2929 [pool-1-thread-2] TRACE org.apache.zookeeper.server.persistence.FileTxnSnapLog - playLog --- create session in log: 1e3516a4bb770000 with timeout: 6000 {noformat} I have the matching run8.log for node3, but it does not mention the bad session at all. I have no usable logs for node1. I also managed to do a dump against each node while the system was live. Here are the results: {noformat} # telnet 90.0.0.1 2888 Trying 90.0.0.1... Connected to 90.0.0.1. Escape character is '^]'. dump SessionTracker dump: Session Sets (3): 0 expire at Thu Jan 26 00:09:30 UTC 2012: 1 expire at Thu Jan 26 00:09:33 UTC 2012: 0x773516a5076a0000 2 expire at Thu Jan 26 00:09:36 UTC 2012: 0xd03516a4f7960000 0x1e3516a4f8520001 ephemeral nodes dump: Sessions with Ephemerals (3): 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 # telnet 90.0.0.2 2888 Trying 90.0.0.2... Connected to 90.0.0.2. Escape character is '^]'. dump SessionTracker dump: org.apache.zookeeper.server.quorum.LearnerSessionTracker@1479a6b ephemeral nodes dump: Sessions with Ephemerals (4): 0x1e3516a4bb770000: /election/cassandra/nominee0000000032 /election/role-cluster-mgmt_cluster/nominee0000000032 /election/role-cluster-persistence_server/nominee0000000032 /election/role-cluster-dht_node/nominee0000000032 /election/dht-reconciliation/nominee0000000032 /election/zkrsm/nominee0000000032 /election/zk_live_members/nominee0000000032 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 # telnet 90.0.0.3 2888 Trying 90.0.0.3... Connected to 90.0.0.3. Escape character is '^]'. dump SessionTracker dump: org.apache.zookeeper.server.quorum.LearnerSessionTracker@5960a2f1 ephemeral nodes dump: Sessions with Ephemerals (3): 0x773516a5076a0000: /election/zk_live_members/nominee0000000035 /election/role-cluster-mgmt_cluster/nominee0000000035 /election/cassandra/nominee0000000036 /election/zkrsm/nominee0000000034 0xd03516a4f7960000: /election/role-cluster-api_provider/nominee0000000033 /election/connection-cluster/nominee0000000033 /election/zkrsm/nominee0000000036 /election/zk_live_members/nominee0000000036 /election/role-cluster-dht_node/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000036 /election/role-cluster-switch_manager/nominee0000000033 /election/DHT/nominee0000000033 /election/dht-reconciliation/nominee0000000034 /election/cassandra/nominee0000000034 /election/role-cluster-persistence_server/nominee0000000034 /election/role-cluster-logical_manager/nominee0000000033 0x1e3516a4f8520001: /election/zkrsm/nominee0000000035 /election/zk_live_members/nominee0000000034 /election/role-cluster-mgmt_cluster/nominee0000000034 /election/cassandra/nominee0000000035 {noformat} The system is still in that state right now; if anyone needs more info from the live system, please ask me for it in the next hour or so. > Data inconsistencies and unexpired ephemeral nodes after cluster restart > ------------------------------------------------------------------------ > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit > Reporter: Jeremy Stribling > Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee0000000011] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee0000000006 nominee0000000010 nominee0000000011 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010 > 90.0.0.221:7777 > cZxid = 0x30000014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x30000014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x30000014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220000 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006 > 90.0.0.223:7777 > cZxid = 0x200000cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x200000cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x200000cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira