Hello, Ensuring reliability requires to use consensus directly in your service or change the service to use distributed log/journal (e.g. bookkeeper).
However following idea is simple and in many situation good enough. If you configure session timeout to 15 seconds - then zookeeper client will be disconnected when partitioned - after max 15 seconds. Old service leader will detect network partition max 15 seconds after it happened. The new service leader should be idle for initial 15+ seconds (let's say 30 seconds). In this way you avoid situation with 2 concurrently working leaders. Described solution has time dependencies and in some situations leads to incorrect state e.g.: High load on machine might cause that zookeeper client will detect disconnection after 60 seconds (instead of expected 15 seconds). In such situation there will be 2 concurrent leaders. Maciej On Thu, Dec 6, 2018 at 8:09 PM Jordan Zimmerman <jor...@jordanzimmerman.com> wrote: > > it seems like the > > inconsistency may be caused by the partition of the Zookeeper cluster > > itself > > Yes - there are many ways in which you can end up with 2 leaders. However, > if properly tuned and configured, it will be for a few seconds at most. > During a GC pause no work is being done anyway. But, this stuff is very > tricky. Requiring an atomically unique leader is actually a design smell > and you should reconsider your architecture. > > > Maybe we can use a more > > lightweight Hazelcast for example? > > There is no distributed system that can guarantee a single leader. Instead > you need to adjust your design and algorithms to deal with this (using > optimistic locking, etc.). > > -Jordan > > > On Dec 6, 2018, at 1:52 PM, Michael Borokhovich <michael...@gmail.com> > wrote: > > > > Thanks Jordan, > > > > Yes, I will try Curator. > > Also, beyond the problem described in the Tech Note, it seems like the > > inconsistency may be caused by the partition of the Zookeeper cluster > > itself. E.g., if a "leader" client is connected to the partitioned ZK > node, > > it may be notified not at the same time as the other clients connected to > > the other ZK nodes. So, another client may take leadership while the > > current leader still unaware of the change. Is it true? > > > > Another follow up question. If Zookeeper can guarantee a single leader, > is > > it worth using it just for leader election? Maybe we can use a more > > lightweight Hazelcast for example? > > > > Michael. > > > > > > On Thu, Dec 6, 2018 at 4:50 AM Jordan Zimmerman < > jor...@jordanzimmerman.com> > > wrote: > > > >> It is not possible to achieve the level of consistency you're after in > an > >> eventually consistent system such as ZooKeeper. There will always be an > >> edge case where two ZooKeeper clients will believe they are leaders > (though > >> for a short period of time). In terms of how it affects Apache Curator, > we > >> have this Tech Note on the subject: > >> https://cwiki.apache.org/confluence/display/CURATOR/TN10 < > >> https://cwiki.apache.org/confluence/display/CURATOR/TN10> (the > >> description is true for any ZooKeeper client, not just Curator > clients). If > >> you do still intend to use a ZooKeeper lock/leader I suggest you try > Apache > >> Curator as writing these "recipes" is not trivial and have many gotchas > >> that aren't obvious. > >> > >> -Jordan > >> > >> http://curator.apache.org <http://curator.apache.org/> > >> > >> > >>> On Dec 5, 2018, at 6:20 PM, Michael Borokhovich <michael...@gmail.com> > >> wrote: > >>> > >>> Hello, > >>> > >>> We have a service that runs on 3 hosts for high availability. However, > at > >>> any given time, exactly one instance must be active. So, we are > thinking > >> to > >>> use Leader election using Zookeeper. > >>> To this goal, on each service host we also start a ZK server, so we > have > >> a > >>> 3-nodes ZK cluster and each service instance is a client to its > dedicated > >>> ZK server. > >>> Then, we implement a leader election on top of Zookeeper using a basic > >>> recipe: > >>> https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection > . > >>> > >>> I have the following questions doubts regarding the approach: > >>> > >>> 1. It seems like we can run into inconsistency issues when network > >>> partition occurs. Zookeeper documentation says that the inconsistency > >>> period may last “tens of seconds”. Am I understanding correctly that > >> during > >>> this time we may have 0 or 2 leaders? > >>> 2. Is it possible to reduce this inconsistency time (let's say to 3 > >>> seconds) by tweaking tickTime and syncLimit parameters? > >>> 3. Is there a way to guarantee exactly one leader all the time? Should > we > >>> implement a more complex leader election algorithm than the one > suggested > >>> in the recipe (using ephemeral_sequential nodes)? > >>> > >>> Thanks, > >>> Michael. > >> > >> > >