In debugging our problem where we had a zookeeper cluster failure (separate
thread) we ran across something that might have happened that could have
caused one of our servers to be quite a bit behind the other two.  We are
running a cluster of 3 zookeeper servers in our development cluster on
Windows.  These are not running as services and are just started from the
command prompt.  Because of this, it's possible that one of the servers had
their command output frozen by someone clicking / marking it.  We saw this
happen accidentally while debugging and the end result is obviously that
all requests to that server back up until they either time out or the
command prompt is unfrozen.

This got us to wondering what would happen if the elected leader were
"frozen" in this manner?  There's no guarantees where in the code it would
be hung to know for certain what would happen when it left this state, but
could there be any problems where the "frozen" server would come out of
this state still thinking it was the leader (since it was stuck) when in
fact another server had been elected in the meantime?  I would imagine this
should resolve itself fairly quickly but is there still a possibility that
this could lead to bad behavior?  Typically if a server fails I would
imagine the zookeeper instance would die or lose leadership because of an
event (failed connection, etc) but this seems slightly different since the
code would be blocked in a random state.

This seems to be more of a Windows issue given how its command prompts work
vs. other OS and we're going to avoid this by either installing a service
that is responsible for starting the zookeeper servers or piping the output
to a file where we can tail the output.

Thanks,
-Scott

Reply via email to