In debugging our problem where we had a zookeeper cluster failure (separate thread) we ran across something that might have happened that could have caused one of our servers to be quite a bit behind the other two. We are running a cluster of 3 zookeeper servers in our development cluster on Windows. These are not running as services and are just started from the command prompt. Because of this, it's possible that one of the servers had their command output frozen by someone clicking / marking it. We saw this happen accidentally while debugging and the end result is obviously that all requests to that server back up until they either time out or the command prompt is unfrozen.
This got us to wondering what would happen if the elected leader were "frozen" in this manner? There's no guarantees where in the code it would be hung to know for certain what would happen when it left this state, but could there be any problems where the "frozen" server would come out of this state still thinking it was the leader (since it was stuck) when in fact another server had been elected in the meantime? I would imagine this should resolve itself fairly quickly but is there still a possibility that this could lead to bad behavior? Typically if a server fails I would imagine the zookeeper instance would die or lose leadership because of an event (failed connection, etc) but this seems slightly different since the code would be blocked in a random state. This seems to be more of a Windows issue given how its command prompts work vs. other OS and we're going to avoid this by either installing a service that is responsible for starting the zookeeper servers or piping the output to a file where we can tail the output. Thanks, -Scott