Wow, thank you for that awesome explanation Ted. Makes me feel really happy that I decided to build redis_failover on top of ZooKeeper! :)
https://github.com/ryanlecompte/redis_failover Best, Ryan On Thu, Apr 19, 2012 at 11:53 AM, Ted Dunning <[email protected]> wrote: > The client can't think it has succeeded with a deletion if it is connected > to the minority side of a partitioned cluster. To think that, the commit > would have to be be ack'ed by a majority which by definition can't happen > either because the master is in the minority and can't get a majority or > because the master is no longer reachable from the server the client is > connected to. If the master is in the minority, then when the commit > fails, the minority will start a leader election which will fail due to > inability to commit. At some point, the majority side will tire of waiting > to hear from the master and will also start an election which will succeed. > > All clients connected to the minority side will be told to reconnect and > will either fail if they can't talk to a node on the master side or will > succeed in connecting with a node in the new quorum. > > When and if the partition heals, the nodes in the minority will > resynchronize and then start handling connections and requests. Pending > requests from them will be discarded because the epoch number will have > been incremented in the new leader election. > > On Thu, Apr 19, 2012 at 11:42 AM, Ryan LeCompte <[email protected]> > wrote: > > > Great questions. I'd also like to add: > > > > - What happens when there is a network partition, and one client > > successfully deletes a znode for which other clients have setup watches? > > Are the clients guaranteed to receive that node deleted watch event if > the > > client successfully thinks it deleted the znode from the other side? > > > > Thanks, > > Ryan > > > > On Thu, Apr 19, 2012 at 11:29 AM, Martin Kou <[email protected]> > wrote: > > > > > Hi folks, > > > > > > I've got a few questions about how Zookeeper servers behave in > > fail-recover > > > scenarios: > > > > > > Assuming I have a 5-Zookeeper cluster, and one of the servers went dark > > and > > > came back, like 1 hour later. > > > > > > 1. Is it correct to assume that clients won't be able to connect to > the > > > recovering server while it's still synchronizing with the leader, and > > > thus > > > any new client connections would automatically fall back to the > other 4 > > > servers during synchronization? > > > 2. The documentation says a newly recovered server would have > > (initLimit > > > * tickTime) seconds to synchronize with the leader when it's > restarted. > > > Is > > > it correct to assume the time needed for synchronization is bounded > by > > > the > > > amount of data managed by Zookeeper? Let's say in the worst case, > > someone > > > set a very large snapCount to the cluster, there were a lot of > > > transactions, but there aren't a lot of znodes - and thus there > aren't > > a > > > lot of data in each Zookeeper server but a very long transaction log. > > > Would > > > that bound still hold? > > > 3. I noticed from the documentation that a Zookeeper server falling > > > > (syncLimit * tickTime) seconds from the leader will be dropped from > > > quorum. > > > I guess that's for detecting network partitions, right? If the > > > partitioned > > > server does report back to the leader later, how would it behave? > (e.g. > > > would it deny new client connections while it's synchronizing?) > > > > > > Thanks. > > > > > > Best Regards, > > > Martin Kou > > > > > >
