What specific log files should I look for? I inspected the config files for all 3 nodes and they *are different. *Specifically, the servers specified are not consistent:
$ cat /data/zookeeper/10.10.5.56/10.10.5.56_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.56/ maxClientCnxns=1000 clientPortAddress=10.10.5.56 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.35:2182:2183 server.3=10.10.5.56:2182:2183 $ cat /data/zookeeper/10.10.5.58/10.10.5.58_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.58/ maxClientCnxns=1000 clientPortAddress=10.10.5.58 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.56:2182:2183 server.3=10.10.5.58:2182:2183 $ cat /data/zookeeper/10.10.5.46/10.10.5.46_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.46/ maxClientCnxns=1000 clientPortAddress=10.10.5.46 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.35:2182:2183 server.3=10.10.5.56:2182:2183 So this looks like a configuration problem not a zookeeper bug correct? On Tue, Dec 20, 2011 at 11:17 AM, Patrick Hunt <ph...@apache.org> wrote: > Really the logs are critical here. If you can provide them it would shed > light. > > Patrick > > On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <br...@apache.org> wrote: > > i've seen it before when the configuration files haven't been setup > > properly. i would check the configuration. if the leader is still the > > leader, it must have active followers connected to it, otherwise it > > would give up leadership. i would use netstat to find out who they > > are. > > > > ben > > > > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen > > <marshall.mcmul...@gmail.com> wrote: > >> Zookeeper devs, > >> > >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. > After > >> a few days of successful operation, we observed all zookeeper reads and > >> writes began failing every time. In our log files, the error being > reported > >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and > >> was surprised to see that *two* of these servers both report they are > the > >> leader! Two of the nodes are in agreement on the Zxid, and one of the > nodes > >> is way out of whack with a much much larger Zxid. The node that all > writes > >> are flowing through is the one with the much higher Zxid. > >> > >> Has anyone ever seen this before? What can I do to diagnose this problem > >> and resolve it? I was considering killing zookeeper on the node that > should > >> not be the leader (the one with the wrong Zxid) and removing the > zookeeper > >> data directory, then restarting zookeeper on that node. Any other ideas? > >> > >> I appreciate any help. >