Just a quick post to point at that the leader election example that was posted on the list earlier today is very clean and handle the disconnected / expired cases.
https://github.com/cyberroadie/zookeeper-leader/ Jérémie On Fri, Nov 18, 2011 at 7:04 PM, Jordan Zimmerman <[email protected]> wrote: > I just did a quickie test. If the cluster goes down you get the Disconnect > but do not get a session expiration. So, there wouldn't be an opportunity > to transition from SUSPENDED to LOST (unless the client makes another ZK > call). So, this brings me back to doing the background sync(). > > -JZ > > On 11/18/11 9:52 AM, "Ted Dunning" <[email protected]> wrote: > >>Is the background sync even necessary? The ZK client itself will >>re-establish connection if it can. >> >>I think that LOST should only be sent on session expiration. >> >>On Fri, Nov 18, 2011 at 1:07 AM, Jordan Zimmerman >><[email protected]>wrote: >> >>> FYI >>> >>> Curator now has a staged connection notification mechanism for dealing >>> with issues like this. When the Curator managed connection receives a >>> Disconnect, it posts a message to listeners that the connection is >>> SUSPENDED. If the connection can be re-established (via a background >>>sync() >>> using the current retry policy) the listeners receive RECONNECTED >>>otherwise >>> they receive LOST. Thus, users of the Curator LeaderSelector can know if >>> they should pause their leader activity and/or stop leader activity. >>> >>> -JZ >>> ________________________________________ >>> From: Ted Dunning [[email protected]] >>> Sent: Monday, November 14, 2011 6:24 PM >>> To: [email protected] >>> Subject: Re: Missing session state handling in most Leader Election >>> implementations >>> >>> On Mon, Nov 14, 2011 at 2:41 PM, Jordan Zimmerman >>><[email protected] >>> >wrote: >>> >>> > It turns out that this is tricky to solve. When the server you're >>> > connected to goes down, you get a >>>Watcher.Event.KeeperState.Disconnected. >>> > However, it could be that you are able to reconnect to another server >>>so >>> > the disconnected event should be ignored. >>> >>> >>> The event should not be ignored. The master should pause in being a >>> master, but not unload any major data structures. If it reconnects >>> instantly, then it should continue as if nothing had happened. You can >>> also have a time limit for how long you wait before you decide to pause >>> operation as master. As you increase that time, you increase the >>> probability of two masters existing at the same time. If the reconnect >>> happens before the timeout, you don't need to both the master. >>> > > -- Jérémie 'ahFeel' BORDIER
