Makes sense. Thanks, Ted. We will design our system to cope with the short periods where we might have two leaders.
On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > ZK is able to guarantee that there is only one leader for the purposes of > updating ZK data. That is because all commits have to originate with the > current quorum leader and then be acknowledged by a quorum of the current > cluster. IF the leader can't get enough acks, then it has de facto lost > leadership. > > The problem comes when there is another system that depends on ZK data. > Such data might record which node is the leader for some other purposes. > That leader will only assume that they have become leader if they succeed > in writing to ZK, but if there is a partition, then the old leader may not > be notified that another leader has established themselves until some time > after it has happened. Of course, if the erstwhile leader tried to validate > its position with a write to ZK, that write would fail, but you can't spend > 100% of your time doing that. > > it all comes down to the fact that a ZK client determines that it is > connected to a ZK cluster member by pinging and that cluster member sees > heartbeats from the leader. The fact is, though, that you can't tune these > pings to be faster than some level because you start to see lots of false > positives for loss of connection. Remember that it isn't just loss of > connection here that is the point. Any kind of delay would have the same > effect. Getting these ping intervals below one second makes for a very > twitchy system. > > > > On Fri, Dec 7, 2018 at 11:03 AM Michael Borokhovich <michael...@gmail.com> > wrote: > > > We are planning to run Zookeeper nodes embedded with the client nodes. > > I.e., each client runs also a ZK node. So, network partition will > > disconnect a ZK node and not only the client. > > My concern is about the following statement from the ZK documentation: > > > > "Timeliness: The clients view of the system is guaranteed to be > up-to-date > > within a certain time bound. (*On the order of tens of seconds.*) Either > > system changes will be seen by a client within this bound, or the client > > will detect a service outage." > > > > What are these "*tens of seconds*"? Can we reduce this time by > configuring > > "syncLimit" and "tickTime" to let's say 5 seconds? Can we have a strong > > guarantee on this time bound? > > > > > > On Thu, Dec 6, 2018 at 1:05 PM Jordan Zimmerman < > > jor...@jordanzimmerman.com> > > wrote: > > > > > > Old service leader will detect network partition max 15 seconds after > > it > > > > happened. > > > > > > If the old service leader is in a very long GC it will not detect the > > > partition. In the face of VM pauses, etc. it's not possible to avoid 2 > > > leaders for a short period of time. > > > > > > -JZ > > >