yes this is a configuration problem. 10.10.5.35 must be running as well right?
ben On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen <marshall.mcmul...@gmail.com> wrote: > What specific log files should I look for? > > I inspected the config files for all 3 nodes and they *are different. > *Specifically, > the servers specified are not consistent: > > $ cat /data/zookeeper/10.10.5.56/10.10.5.56_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.56/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.56 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.35:2182:2183 > server.3=10.10.5.56:2182:2183 > > $ cat /data/zookeeper/10.10.5.58/10.10.5.58_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.58/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.58 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.56:2182:2183 > server.3=10.10.5.58:2182:2183 > > $ cat /data/zookeeper/10.10.5.46/10.10.5.46_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.46/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.46 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.35:2182:2183 > server.3=10.10.5.56:2182:2183 > > So this looks like a configuration problem not a zookeeper bug correct? > > > On Tue, Dec 20, 2011 at 11:17 AM, Patrick Hunt <ph...@apache.org> wrote: > >> Really the logs are critical here. If you can provide them it would shed >> light. >> >> Patrick >> >> On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <br...@apache.org> wrote: >> > i've seen it before when the configuration files haven't been setup >> > properly. i would check the configuration. if the leader is still the >> > leader, it must have active followers connected to it, otherwise it >> > would give up leadership. i would use netstat to find out who they >> > are. >> > >> > ben >> > >> > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen >> > <marshall.mcmul...@gmail.com> wrote: >> >> Zookeeper devs, >> >> >> >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. >> After >> >> a few days of successful operation, we observed all zookeeper reads and >> >> writes began failing every time. In our log files, the error being >> reported >> >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and >> >> was surprised to see that *two* of these servers both report they are >> the >> >> leader! Two of the nodes are in agreement on the Zxid, and one of the >> nodes >> >> is way out of whack with a much much larger Zxid. The node that all >> writes >> >> are flowing through is the one with the much higher Zxid. >> >> >> >> Has anyone ever seen this before? What can I do to diagnose this problem >> >> and resolve it? I was considering killing zookeeper on the node that >> should >> >> not be the leader (the one with the wrong Zxid) and removing the >> zookeeper >> >> data directory, then restarting zookeeper on that node. Any other ideas? >> >> >> >> I appreciate any help. >>