ZK is able to guarantee that there is only one leader for the purposes of
updating ZK data. That is because all commits have to originate with the
current quorum leader and then be acknowledged by a quorum of the current
cluster. IF the leader can't get enough acks, then it has de facto lost
leadership.

The problem comes when there is another system that depends on ZK data.
Such data might record which node is the leader for some other purposes.
That leader will only assume that they have become leader if they succeed
in writing to ZK, but if there is a partition, then the old leader may not
be notified that another leader has established themselves until some time
after it has happened. Of course, if the erstwhile leader tried to validate
its position with a write to ZK, that write would fail, but you can't spend
100% of your time doing that.

it all comes down to the fact that a ZK client determines that it is
connected to a ZK cluster member by pinging and that cluster member sees
heartbeats from the leader. The fact is, though, that you can't tune these
pings to be faster than some level because you start to see lots of false
positives for loss of connection. Remember that it isn't just loss of
connection here that is the point. Any kind of delay would have the same
effect. Getting these ping intervals below one second makes for a very
twitchy system.



On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich <michael...@gmail.com>
wrote:

> We are planning to run Zookeeper nodes embedded with the client nodes.
> I.e., each client runs also a ZK node. So, network partition will
> disconnect a ZK node and not only the client.
> My concern is about the following statement from the ZK documentation:
>
> "Timeliness: The clients view of the system is guaranteed to be up-to-date
> within a certain time bound. (*On the order of tens of seconds.*) Either
> system changes will be seen by a client within this bound, or the client
> will detect a service outage."
>
> What are these "*tens of seconds*"? Can we reduce this time by configuring
> "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong
> guarantee on this time bound?
>
>
> On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman <
> jor...@jordanzimmerman.com>
> wrote:
>
> > > Old service leader will detect network partition max 15 seconds after
> it
> > > happened.
> >
> > If the old service leader is in a very long GC it will not detect the
> > partition. In the face of VM pauses, etc. it's not possible to avoid 2
> > leaders for a short period of time.
> >
> > -JZ
>

Reply via email to