Hi, I don't intend to hijack Dr. Hao's email thread here, but I would like to point out two things:
1. I use embedded server as well. But I don't use any setters. We extend QuorumPeerMain and call initializeAndRun() function. So we are doing pretty much the same thing that QuorumPeerMain is doing. However, note that I am seeing the same problem (in ZK 3.3.0) as Dr Hao is seeing. I haven't debugged the cause yet. I assumed that this was my implementation error (and it could still be). Nevertheless, this could turn out to be a bug as well. 2. With respect to Ted's point about backward compatibility, I would suggest to take an approach of having an API to support embedded ZK instead of asking users to not embed ZK. -Vishal On Thu, Aug 12, 2010 at 3:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > It doesn't. > > But running a ZK cluster that is incorrectly configured can cause this > problem and configuring ZK using setters is likely to be subject to changes > in what configuration is needed. Thus, your style of code is more subject > to decay over time than is nice. > > The rest of my comments detail *other* reasons why embedding a coordination > layer in the code being coordinated is a bad idea. > > On Thu, Aug 12, 2010 at 6:33 AM, Vishal K <vishalm...@gmail.com> wrote: > > > Hi Ted, > > > > Can you explain why running ZK in embedded mode can cause znode > > inconsistencies? > > Thanks. > > > > -Vishal > > > > On Thu, Aug 12, 2010 at 12:01 AM, Ted Dunning <ted.dunn...@gmail.com> > > wrote: > > > > > Try running the server in non-embedded mode. > > > > > > Also, you are assuming that you know everything about how to configure > > the > > > quorumPeer. That is going to change and your code will break at that > > time. > > > If you use a non-embedded cluster, this won't be a problem and you > will > > be > > > able to upgrade ZK version without having to restart your service. > > > > > > My own opinion is that running an embedded ZK is a serious > architectural > > > error. Since I don't know your particular situation, it might be > > > different, > > > but there is an inherent contradiction involved in running a > coordination > > > layer as part of the thing being coordinated. Whatever your software > > does, > > > it isn't what ZK does. As such, it is better to factor out the ZK > > > functionality and make it completely stable. That gives you a much > > simpler > > > world and will make it easier for you to trouble shoot your system. > The > > > simple fact that you can't take down your service without affecting the > > > reliability of your ZK layer makes this a very bad idea. > > > > > > The problems you are having now are only a preview of what this > > > architectural error leads to. There will be more problems and many of > > them > > > are likely to be more subtle and lead to service interruptions and lots > > of > > > wasted time. > > > > > > On Wed, Aug 11, 2010 at 8:49 PM, Dr Hao He <h...@softtouchit.com> wrote: > > > > > > > hi, Ted and Mahadev, > > > > > > > > > > > > Here are some more details about my setup: > > > > > > > > I run zookeeper in the embedded mode with the following code: > > > > > > > > quorumPeer = new QuorumPeer(); > > > > > > > > quorumPeer.setClientPort(getClientPort()); > > > > quorumPeer.setTxnFactory(new > > > > FileTxnSnapLog(new File(getDataLogDir()), new File(getDataDir()))); > > > > > > > > quorumPeer.setQuorumPeers(getServers()); > > > > > > > > quorumPeer.setElectionType(getElectionAlg()); > > > > > > quorumPeer.setMyid(getServerId()); > > > > > > > > quorumPeer.setTickTime(getTickTime()); > > > > > > > > quorumPeer.setInitLimit(getInitLimit()); > > > > > > > > quorumPeer.setSyncLimit(getSyncLimit()); > > > > > > > > quorumPeer.setQuorumVerifier(getQuorumVerifier()); > > > > > > > > quorumPeer.setCnxnFactory(cnxnFactory); > > > > quorumPeer.start(); > > > > > > > > > > > > The configuration values are read from the following XML document for > > > > server 1: > > > > > > > > <cluster tickTime="1000" initLimit="10" syncLimit="5" > clientPort="2181" > > > > serverId="1"> > > > > <member id="1" host="192.168.2.6:2888:3888"/> > > > > <member id="2" host="192.168.2.3:2888:3888"/> > > > > <member id="3" host="192.168.2.4:2888:3888"/> > > > > </cluster> > > > > > > > > > > > > The other servers have the same configurations except their ids being > > > > changed to 2 and 3. > > > > > > > > The error occurred on server 3 when I batch loaded some messages to > > > server > > > > 1. However, this error does not always happen. I am not sure > exactly > > > what > > > > trigged this error yet. > > > > > > > > I also performed the "stat" operation on one of the "No exit" node > and > > > got: > > > > > > > > stat > > > > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000001583 > > > > Exception in thread "main" java.lang.NullPointerException > > > > at > > > > org.apache.zookeeper.ZooKeeperMain.printStat(ZooKeeperMain.java:129) > > > > at > > > > > org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:715) > > > > at > > > > org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:579) > > > > at > > > > > org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:351) > > > > at > > org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:309) > > > > at > > org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:268) > > > > [...@t43 zookeeper-3.2.2]$ bin/zkCli.sh > > > > > > > > > > > > Those message nodes are created as CreateMode.PERSISTENT_SEQUENTIAL > and > > > are > > > > deleted by the last server who has read them. > > > > > > > > If I remove the troubled server's zookeeper log directory and restart > > the > > > > server, then everything is ok. > > > > > > > > I will try to get the nc result next time I see this problem. > > > > > > > > > > > > Dr Hao He > > > > > > > > XPE - the truly SOA platform > > > > > > > > h...@softtouchit.com > > > > http://softtouchit.com > > > > http://itunes.com/apps/Scanmobile > > > > > > > > On 12/08/2010, at 12:32 AM, Mahadev Konar wrote: > > > > > > > > > HI Dr Hao, > > > > > Can you please post the configuration of all the 3 zookeeper > > servers? > > > I > > > > > suspect it might be misconfigured clusters and they might not > belong > > to > > > > the > > > > > same ensemble. > > > > > > > > > > Just to be clear: > > > > > > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002807 > > > > > > > > > > And other such nodes exist on one of the zookeeper servers and the > > same > > > > node > > > > > does not exist on other servers? > > > > > > > > > > Also, as ted pointed out, can you please post the output of echo > > ³stat² > > > | > > > > nc > > > > > localhost 2181 (on all the 3 servers) to the list? > > > > > > > > > > Thanks > > > > > mahadev > > > > > > > > > > > > > > > > > > > > On 8/11/10 12:10 AM, "Dr Hao He" <h...@softtouchit.com> wrote: > > > > > > > > > >> hi, Ted, > > > > >> > > > > >> Thanks for the reply. Here is what I did: > > > > >> > > > > >> [zk: localhost:2181(CONNECTED) 0] ls > > > > >> > > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948 > > > > >> [] > > > > >> zk: localhost:2181(CONNECTED) 1] ls > > > > >> /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs > > > > >> [msg0000002807, msg0000002700, msg0000002701, msg0000002804, > > > > msg0000002704, > > > > >> msg0000002706, msg0000002601, msg0000001849, msg0000001847, > > > > msg0000002508, > > > > >> msg0000002609, msg0000001841, msg0000002607, msg0000002606, > > > > msg0000002604, > > > > >> msg0000002809, msg0000002817, msg0000001633, msg0000002812, > > > > msg0000002814, > > > > >> msg0000002711, msg0000002815, msg0000002713, msg0000002716, > > > > msg0000001772, > > > > >> msg0000002811, msg0000001635, msg0000001774, msg0000002515, > > > > msg0000002610, > > > > >> msg0000001838, msg0000002517, msg0000002612, msg0000002519, > > > > msg0000001973, > > > > >> msg0000001835, msg0000001974, msg0000002619, msg0000001831, > > > > msg0000002510, > > > > >> msg0000002512, msg0000002615, msg0000002614, msg0000002617, > > > > msg0000002104, > > > > >> msg0000002106, msg0000001769, msg0000001768, msg0000002828, > > > > msg0000002822, > > > > >> msg0000001760, msg0000002820, msg0000001963, msg0000001961, > > > > msg0000002110, > > > > >> msg0000002118, msg0000002900, msg0000002836, msg0000001757, > > > > msg0000002907, > > > > >> msg0000001753, msg0000001752, msg0000001755, msg0000001952, > > > > msg0000001958, > > > > >> msg0000001852, msg0000001956, msg0000001854, msg0000002749, > > > > msg0000001608, > > > > >> msg0000001609, msg0000002747, msg0000002882, msg0000001743, > > > > msg0000002888, > > > > >> msg0000001605, msg0000002885, msg0000001487, msg0000001746, > > > > msg0000002330, > > > > >> msg0000001749, msg0000001488, msg0000001489, msg0000001881, > > > > msg0000001491, > > > > >> msg0000002890, msg0000001889, msg0000002758, msg0000002241, > > > > msg0000002892, > > > > >> msg0000002852, msg0000002759, msg0000002898, msg0000002850, > > > > msg0000001733, > > > > >> msg0000002751, msg0000001739, msg0000002753, msg0000002756, > > > > msg0000002332, > > > > >> msg0000001872, msg0000002233, msg0000001721, msg0000001627, > > > > msg0000001720, > > > > >> msg0000001625, msg0000001628, msg0000001629, msg0000001729, > > > > msg0000002350, > > > > >> msg0000001727, msg0000002352, msg0000001622, msg0000001726, > > > > msg0000001623, > > > > >> msg0000001723, msg0000001724, msg0000001621, msg0000002736, > > > > msg0000002738, > > > > >> msg0000002363, msg0000001717, msg0000002878, msg0000002362, > > > > msg0000002361, > > > > >> msg0000001611, msg0000001894, msg0000002357, msg0000002218, > > > > msg0000002358, > > > > >> msg0000002355, msg0000001895, msg0000002356, msg0000001898, > > > > msg0000002354, > > > > >> msg0000001996, msg0000001990, msg0000002093, msg0000002880, > > > > msg0000002576, > > > > >> msg0000002579, msg0000002267, msg0000002266, msg0000002366, > > > > msg0000001901, > > > > >> msg0000002365, msg0000001903, msg0000001799, msg0000001906, > > > > msg0000002368, > > > > >> msg0000001597, msg0000002679, msg0000002166, msg0000001595, > > > > msg0000002481, > > > > >> msg0000002482, msg0000002373, msg0000002374, msg0000002371, > > > > msg0000001599, > > > > >> msg0000002773, msg0000002274, msg0000002275, msg0000002270, > > > > msg0000002583, > > > > >> msg0000002271, msg0000002580, msg0000002067, msg0000002277, > > > > msg0000002278, > > > > >> msg0000002376, msg0000002180, msg0000002467, msg0000002378, > > > > msg0000002182, > > > > >> msg0000002377, msg0000002184, msg0000002379, msg0000002187, > > > > msg0000002186, > > > > >> msg0000002665, msg0000002666, msg0000002381, msg0000002382, > > > > msg0000002661, > > > > >> msg0000002662, msg0000002663, msg0000002385, msg0000002284, > > > > msg0000002766, > > > > >> msg0000002282, msg0000002190, msg0000002599, msg0000002054, > > > > msg0000002596, > > > > >> msg0000002453, msg0000002459, msg0000002457, msg0000002456, > > > > msg0000002191, > > > > >> msg0000002652, msg0000002395, msg0000002650, msg0000002656, > > > > msg0000002655, > > > > >> msg0000002189, msg0000002047, msg0000002658, msg0000002659, > > > > msg0000002796, > > > > >> msg0000002250, msg0000002255, msg0000002589, msg0000002257, > > > > msg0000002061, > > > > >> msg0000002064, msg0000002585, msg0000002258, msg0000002587, > > > > msg0000002444, > > > > >> msg0000002446, msg0000002447, msg0000002450, msg0000002646, > > > > msg0000001501, > > > > >> msg0000002591, msg0000002592, msg0000001503, msg0000001506, > > > > msg0000002260, > > > > >> msg0000002594, msg0000002262, msg0000002263, msg0000002264, > > > > msg0000002590, > > > > >> msg0000002132, msg0000002130, msg0000002530, msg0000002931, > > > > msg0000001559, > > > > >> msg0000001808, msg0000002024, msg0000001553, msg0000002939, > > > > msg0000002937, > > > > >> msg0000001556, msg0000002935, msg0000002933, msg0000002140, > > > > msg0000001937, > > > > >> msg0000002143, msg0000002520, msg0000002522, msg0000002429, > > > > msg0000002524, > > > > >> msg0000002920, msg0000002035, msg0000001561, msg0000002134, > > > > msg0000002138, > > > > >> msg0000002925, msg0000002151, msg0000002287, msg0000002555, > > > > msg0000002010, > > > > >> msg0000002002, msg0000002290, msg0000001537, msg0000002005, > > > > msg0000002147, > > > > >> msg0000002145, msg0000002698, msg0000001592, msg0000001810, > > > > msg0000002690, > > > > >> msg0000002691, msg0000001911, msg0000001910, msg0000002693, > > > > msg0000001812, > > > > >> msg0000001817, msg0000001547, msg0000002012, msg0000002015, > > > > msg0000002941, > > > > >> msg0000001688, msg0000002018, msg0000002684, msg0000002944, > > > > msg0000001540, > > > > >> msg0000002686, msg0000001541, msg0000002946, msg0000002688, > > > > msg0000001584, > > > > >> msg0000002948] > > > > >> > > > > >> [zk: localhost:2181(CONNECTED) 7] delete > > > > >> > > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948 > > > > >> Node does not exist: > > > > >> > > > /xpe/queues/3bd7851e79381ef4bfd1a5857b5e34c04e5159e5/msgs/msg0000002948 > > > > >> > > > > >> When I performed the same operations on another node, none of > those > > > > nodes > > > > >> existed. > > > > >> > > > > >> > > > > >> Dr Hao He > > > > >> > > > > >> XPE - the truly SOA platform > > > > >> > > > > >> h...@softtouchit.com > > > > >> http://softtouchit.com > > > > >> http://itunes.com/apps/Scanmobile > > > > >> > > > > >> On 11/08/2010, at 4:38 PM, Ted Dunning wrote: > > > > >> > > > > >>> Can you provide some more information? The output of some of the > > > four > > > > >>> letter commands and a transcript of what you are doing would be > > very > > > > >>> helpful. > > > > >>> > > > > >>> Also, there is no way for znodes to exist on one node of a > properly > > > > >>> operating ZK cluster and not on either of the other two. > Something > > > has > > > > to > > > > >>> be wrong and I would vote for operator error (not to cast > > aspersions, > > > > it is > > > > >>> just that humans like you and *me* make more errors than ZK > does). > > > > >>> > > > > >>> On Tue, Aug 10, 2010 at 11:32 PM, Dr Hao He <h...@softtouchit.com> > > > > wrote: > > > > >>> > > > > >>>> hi, All, > > > > >>>> > > > > >>>> I have a 3-host cluster running ZooKeeper 3.2.2. On one of the > > > hosts, > > > > >>>> there are a number of nodes that I can "get" and "ls" using > > zkCli.sh > > > . > > > > >>>> However, when I tried to "delete" any of them, I got "Node does > > not > > > > exist" > > > > >>>> error. Those nodes do not exist on the other two hosts. > > > > >>>> > > > > >>>> Any idea how we should handle this type of errors and what might > > > have > > > > >>>> caused this problem? > > > > >>>> > > > > >>>> Dr Hao He > > > > >>>> > > > > >>>> XPE - the truly SOA platform > > > > >>>> > > > > >>>> h...@softtouchit.com > > > > >>>> http://softtouchit.com > > > > >>>> http://itunes.com/apps/Scanmobile > > > > >>>> > > > > >>>> > > > > >> > > > > >> > > > > > > > > > > > > > > > > > > > > > > > >