Agree with Pat. We should dig into this ASAP. Marshall, Mind opening a jira nad posting the logs to it?
thanks mahadev On Tue, Dec 20, 2011 at 10:17 AM, Patrick Hunt <ph...@apache.org> wrote: > Really the logs are critical here. If you can provide them it would shed > light. > > Patrick > > On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <br...@apache.org> wrote: > > i've seen it before when the configuration files haven't been setup > > properly. i would check the configuration. if the leader is still the > > leader, it must have active followers connected to it, otherwise it > > would give up leadership. i would use netstat to find out who they > > are. > > > > ben > > > > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen > > <marshall.mcmul...@gmail.com> wrote: > >> Zookeeper devs, > >> > >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. > After > >> a few days of successful operation, we observed all zookeeper reads and > >> writes began failing every time. In our log files, the error being > reported > >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and > >> was surprised to see that *two* of these servers both report they are > the > >> leader! Two of the nodes are in agreement on the Zxid, and one of the > nodes > >> is way out of whack with a much much larger Zxid. The node that all > writes > >> are flowing through is the one with the much higher Zxid. > >> > >> Has anyone ever seen this before? What can I do to diagnose this problem > >> and resolve it? I was considering killing zookeeper on the node that > should > >> not be the leader (the one with the wrong Zxid) and removing the > zookeeper > >> data directory, then restarting zookeeper on that node. Any other ideas? > >> > >> I appreciate any help. >