Hi Folks, We have been playing around with ZooKeeper for a few weeks now, and reading carefully through the documentation I noticed this statement:
If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existence, you will miss the event if the znode is created and deleted while you are disconnected. As noticed in ZOOKEEPER-1209, this can cause really important issues. As Leader election is one of the most demanded feature / recipe, I would really like to see the official recipe fixed and fully functional. I decided to throw a look at other implementations of the leader election and surprisingly, none of them seemed to care about the Disconnected / Expired / SyncConnected events in a simple way. Here's my quick analysis of what they do, and I'd love to know whether I'm missing something or if they are really wrong: Twitter commons library Election recipe is based on their "Group" implementation, with EPHEMERAL|SEQUENTIAL nodes in the same way of the official LES algorithm. Looking at the Group impl ( https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/zookeeper/Group.java ), they handle the Expired event and retry to join / watch, which makes a getClient() that will recreate the connection if the connection has expired. This looks fine for the Expired event, but what about Disconnected / SyncConnected events ? Nothing. Netflix' curator library has an approach where the leader acquires an inter process mutex, backed by a group with EPHEMERAL|SEQUENTIAL nodes also. Netflix's library has a big advantage: It has a built in API for retrying actions, so leader election will try to acquire the lock, and retry if anything goes wrong in the middle. In case of any event, the loop waiting for the lock will be notified, and will retry in case of any failure, so a Disconnected or Expired event would be handled properly. On the other side, it seems that once the leader has been elected, the event just seems to be ignored. This may lead to the same split brain issue than the original LES example. (see https://github.com/Netflix/curator/blob/master/curator-recipes/src/main/java/com/netflix/curator/framework/recipes/locks/LockInternals.java for details). Here's all I came up to so far. If you guys have the time to throw a look to these implementations, I would love to know if I missed something. So, I think this split brain issue may almost never happen, but as usually what should never happen hits you hard when you don't expect it. A robust Leader election implementation would be really great to have. Cheers, Jérémie
