[ https://issues.apache.org/jira/browse/ZOOKEEPER-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191650#comment-13191650 ]
Jeremy Stribling commented on ZOOKEEPER-1367: --------------------------------------------- bq. Hm..... very interesting. What exactly does this mean? You mentioned earlier that you "embed Zookeeper into our application framework and set up things through code" how exactly are you performing this "restart". Is ZK a separate process, are you killing processes, or are you calling some code to effect this? I ask because we really don't support this and I'm wondering if that could be related. ZK is embedded into a java process, running alongside some other java apps we need to manage. The reason for this is that dynamic cluster membership is absolutely required for our application; we cannot know the IPs/ports/server IDs of all of the Zookeeper servers that will exist in the cluster. So as new nodes come online, they connect to a centralized part of our service, and we distribute the new list of servers to all the existing servers, so they can restart themselves. By "restart" here, I mean we call QuorumPeer.shutdown (and FastleaderElection.shutdown), delete the previous QuorumPeer, and construct a new one with the new configuration. This is the same way we ran things under 3.3.3. I understand that this is not officially supported, but in my heart of hearts I don't believe it is related to the bug at hand, so I appreciate your indulgence on the matter. bq. that's true, but the more variables we can eliminate the more easy it will be to track the real issue down. We are supposed to be running with synced clocks, but QA is trying to track down a bug with their system right now to figure out why NTP isn't working in their environment. Sorry for the extra level of confusion. bq. btw, if you do have QA retest this please do capture all the logs (log4j). Unfortunately the two logs don't both show the znode expiring (I see the time in question in one but not the other log), that would give much more insight into what happened... I don't quite follow your question. Do you mean that my logs aren't capturing all of the output logged by ZooKeeper? Or are you asking for the logs from the previous run of the system, when the znodes were originally created? I will definitely try to get the latter, but as for the former problem -- this is everything that ZooKeeper logged, so I'm not sure what else I can capture during the run. I think the very problem is that one of the nodes didn't expire one of the sessions, so I wouldn't expect that to be in the log for that server. But maybe I don't quite understand what you're asking. > Data inconsistencies and unexpired ephemeral nodes after cluster restart > ------------------------------------------------------------------------ > > Key: ZOOKEEPER-1367 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1367 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.2 > Environment: Debian Squeeze, 64-bit > Reporter: Jeremy Stribling > Priority: Blocker > Fix For: 3.4.3 > > Attachments: ZOOKEEPER-1367.tgz > > > In one of our tests, we have a cluster of three ZooKeeper servers. We kill > all three, and then restart just two of them. Sometimes we notice that on > one of the restarted servers, ephemeral nodes from previous sessions do not > get deleted, while on the other server they do. We are effectively running > 3.4.2, though technically we are running 3.4.1 with the patch manually > applied for ZOOKEEPER-1333 and a C client for 3.4.1 with the patches for > ZOOKEEPER-1163. > I noticed that when I connected using zkCli.sh to the first node (90.0.0.221, > zkid 84), I saw only one znode in a particular path: > {quote} > [zk: 90.0.0.221:2888(CONNECTED) 0] ls /election/zkrsm > [nominee0000000011] > [zk: 90.0.0.221:2888(CONNECTED) 1] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > {quote} > However, when I connect zkCli.sh to the second server (90.0.0.222, zkid 251), > I saw three znodes under that same path: > {quote} > [zk: 90.0.0.222:2888(CONNECTED) 2] ls /election/zkrsm > nominee0000000006 nominee0000000010 nominee0000000011 > [zk: 90.0.0.222:2888(CONNECTED) 2] get /election/zkrsm/nominee0000000011 > 90.0.0.222:7777 > cZxid = 0x400000027 > ctime = Thu Jan 19 08:18:24 UTC 2012 > mZxid = 0x400000027 > mtime = Thu Jan 19 08:18:24 UTC 2012 > pZxid = 0x400000027 > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220001 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 3] get /election/zkrsm/nominee0000000010 > 90.0.0.221:7777 > cZxid = 0x30000014c > ctime = Thu Jan 19 07:53:42 UTC 2012 > mZxid = 0x30000014c > mtime = Thu Jan 19 07:53:42 UTC 2012 > pZxid = 0x30000014c > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0xa234f4f3bc220000 > dataLength = 16 > numChildren = 0 > [zk: 90.0.0.222:2888(CONNECTED) 4] get /election/zkrsm/nominee0000000006 > 90.0.0.223:7777 > cZxid = 0x200000cab > ctime = Thu Jan 19 08:00:30 UTC 2012 > mZxid = 0x200000cab > mtime = Thu Jan 19 08:00:30 UTC 2012 > pZxid = 0x200000cab > cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x5434f5074e040002 > dataLength = 16 > numChildren = 0 > {quote} > These never went away for the lifetime of the server, for any clients > connected directly to that server. Note that this cluster is configured to > have all three servers still, the third one being down (90.0.0.223, zkid 162). > I captured the data/snapshot directories for the the two live servers. When > I start single-node servers using each directory, I can briefly see that the > inconsistent data is present in those logs, though the ephemeral nodes seem > to get (correctly) cleaned up pretty soon after I start the server. > I will upload a tar containing the debug logs and data directories from the > failure. I think we can reproduce it regularly if you need more info. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira