[ https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar updated ZOOKEEPER-1319: ------------------------------------- Priority: Blocker (was: Major) Fix Version/s: 3.4.1 > Missing data after restarting+expanding a cluster > ------------------------------------------------- > > Key: ZOOKEEPER-1319 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.4.0 > Environment: Linux (Debian Squeeze) > Reporter: Jeremy Stribling > Priority: Blocker > Labels: cluster, data > Fix For: 3.4.1 > > Attachments: logs.tgz > > > I've been trying to update to ZK 3.4.0 and have had some issues where some > data become inaccessible after adding a node to a cluster. My use case is a > bit strange (as explained before on this list) in that I try to grow the > cluster dynamically by having an external program automatically restart > Zookeeper servers in a controlled way whenever the list of participating ZK > servers needs to change. This used to work just fine in 3.3.3 (and before), > so this represents a regression. > The scenario I see is this: > 1) Start up a 1-server ZK cluster (the server has ZK ID 0). > 2) A client connects to the server, and makes a bunch of znodes, in > particular a znode called "/membership". > 3) Shut down the cluster. > 4) Bring up a 2-server ZK cluster, including the original server 0 with its > existing data, and a new server with ZK ID 1. > 5) Node 0 has the highest zxid and is elected leader. > 6) A client connecting to server 1 tries to "get /membership" and gets back a > -101 error code (no such znode). > 7) The same client then tries to "create /membership" and gets back a -110 > error code (znode already exists). > 8) Clients connecting to server 0 can successfully "get /membership". > I will attach a tarball with debug logs for both servers, annotating where > steps #1 and #4 happen. You can see that the election involves a proposal > for zxid 110 from server 0, but immediately following the election server 1 > has these lines: > 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN > org.apache.zookeeper.server.quorum.Learner - Got zxid 0x100000001 expected > 0x1 > 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO > org.apache.zookeeper.server.persistence.FileTxnLog - Creating new log file: > log.100000001 > Perhaps that's not relevant, but it struck me as odd. At the end of server > 1's log you can see a repeated cycle of getData->create->getData as the > client tries to make sense of the inconsistent responses. > The other piece of information is that if I try to use the on-disk > directories for either of the servers to start a new one-node ZK cluster, all > the data are accessible. > I haven't tried writing a program outside of my application to reproduce > this, but I can do it very easily with some of my app's tests if anyone needs > more information. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira