I think that the guarantee we still get in your scenario is that when the "boosted read" completes, the client knows that it got every write that completed before the boosted read (or "strong sync" which preceded it if we use that idea) was invoked. Of course as you say it doesn't guarantee that its still the leader, but if the read returns and says "you're still the leader" then you can process every event you got (if that's what your application is doing) before you invoked the read without worrying that someone else can also process these events. If I understand correctly this is what John is saying in his first email.
Its true that even if we run an agreement without any data it will slow down the application somewhat, but its the application's choice not to use this operation - currently an application that wants strong semantics already queues its sync behind a list of writes on the leader - I'm not sure that actually sending this sync out to followers would add much to the latency of the sync. It will only add leader-followers roundtrip latency if the pipeline was already empty. Alex On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected]> wrote: > Say that we implement what you're suggesting. Could you check if this > scenario can happen: > > 1- Client C1 is the current leader and it super boosted read to make sure it > is still the leader; > 2- We process the super boosted read having it through the zab pipeline; > 3- When we send the response to C1 we slow down the whole deal: the response > to C1 gets delayed and we stall C1; > 4- In the meanwhile, C1's session expires on the server side and its > ephemeral leadership node is removed; > 5- A new client C2 is elected and starts exercising leadership; > 6- Now C1 comes back to normal and receives the response of the super boosted > read saying that it is still the leader. > > If my interpretation is not incorrect, the only way to prevent this scenario > from happening is if the session expires on the client side before it > receives the response of the read. It doesn't look like we can do it if > process clocks can be arbitrarily delayed. > > Note that one issue is that the behavior of ephemerals is highly dependent > upon timers, so I don't think we can avoid making some timing assumptions > altogether. The question is if we are better off with a mechanism relying > upon acknowledgements. My sense is that application-level fencing is > preferable (if not necessary) for applications like the ones JC is mentioning > or BookKeeper. > > I'm not concerned about writes to disk, which I agree we don't need for sync. > I'm more concerned about having it going through the whole pipeline, which > will induce more traffic to zab and increase latency for an application that > uses it heavily. > > -Flavio > > On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote: > >> another idea is to add this functionality to MultiOp - have read only >> transactions be replicated but not logged or logged asynchronously. >> I'm not sure how it works right now if I do a read-only MultiOp >> transaction - does it replicate the transaction or answer it locally >> on the leader ? >> >> Alex >> >> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]> wrote: >>> Thanks for the explanation. >>> >>> I guess one could always invoke a write operation instead of sync to >>> get the more strict semantics, but as John suggests, it might be a >>> good idea to add a new type of operation that requires followers to >>> ack but doesn't require them to log to disk - this seems sufficient in >>> our case. >>> >>> Alex >>> >>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]> >>> wrote: >>>> In theory, the scenario you're describing could happen, but I would argue >>>> that it is unlikely given that: 1) a leader pings followers twice a tick >>>> to make sure that it has a quorum of supporters (lead()); 2) followers >>>> give up on a leader upon catching an exception (followLeader()). One could >>>> calibrate tickTime to make the probability of having this scenario low. >>>> >>>> Let me also revisit the motivation for the way we designed sync. ZooKeeper >>>> has been designed to serve reads efficiently and making sync go through >>>> the pipeline would slow down reads. Although optional, we thought it would >>>> be a good idea to make it as efficient as possible to comply with the >>>> original expectations for the service. We consequently came up with this >>>> cheap way of making sure that a read sees all pending updates. It is >>>> correct that there are some corner cases that it doesn't cover. One is the >>>> case you mentioned. Another is having the sync finishing before the client >>>> submits the read and having a write committing in between. We rely upon >>>> the way we implement timeouts and some minimum degree of synchrony for the >>>> clients when submitting operations to guarantee that the scheme work. >>>> >>>> We thought about the option of having the sync operation going through the >>>> pipeline, and in fact it would have been easier to implement it just as a >>>> regular write, but we opted not to because we felt it was sufficient for >>>> the use cases we had and more efficient as I already argued. >>>> >>>> Hope it helps to clarify. >>>> >>>> -Flavio >>>> >>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: >>>> >>>>> thanks for the explanation! but how do you avoid having the scenario >>>>> raised by John ? >>>>> lets say you're a client connected to F, and F is connected to L. Lets >>>>> also say that L's pipeline >>>>> is now empty, and both F and L are partitioned from 3 other servers in >>>>> the system that have already >>>>> elected a new leader L'. Now I go to L' and write something. L still >>>>> thinks its the leader because the >>>>> detection that followers left it is obviously timeout dependent. So >>>>> when F sends your sync to L and L returns >>>>> it to F, you actually miss my write! >>>>> >>>>> Alex >>>>> >>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> >>>>> wrote: >>>>>> Hi Alex, Because of the following: >>>>>> >>>>>> 1- A follower F processes operations from a client in FIFO order, and >>>>>> say that a client submits as you say sync + read; >>>>>> 2- A sync will be processed by the leader and returned to the follower. >>>>>> It will be queued after all pending updates that the follower hasn't >>>>>> processed; >>>>>> 3- The follower will process all pending updates before processing the >>>>>> response of the sync; >>>>>> 4- Once the follower processes the sync, it picks the read operation to >>>>>> process. It reads the local state of the follower and returns to the >>>>>> client. >>>>>> >>>>>> When we process the read in Step 4, we have applied all pending updates >>>>>> the leader had for the follower by the time the read request started. >>>>>> >>>>>> This implementation is a bit of a hack because it doesn't follow the >>>>>> same code path as the other operations that go to the leader, but it >>>>>> avoids some unnecessary steps, which is important for fast reads. In the >>>>>> sync case, the other followers don't really need to know about it (there >>>>>> is nothing to be updated) and the leader simply inserts it in the >>>>>> sequence of updates of F, ordering it. >>>>>> >>>>>> -Flavio >>>>>> >>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: >>>>>> >>>>>>> Hi Flavio, >>>>>>> >>>>>>>> Starting a read operation concurrently with a sync implies that the >>>>>>>> result of the read will not miss an update committed before the read >>>>>>>> started. >>>>>>> >>>>>>> I thought that the intention of sync was to give something like >>>>>>> linearizable reads, so if you invoke a sync and then a read, your read >>>>>>> is guaranteed to (at least) see any write which completed before the >>>>>>> sync began. Is this the intention ? If so, how is this achieved >>>>>>> without running agreement on the sync op ? >>>>>>> >>>>>>> Thanks, >>>>>>> Alex >>>>>>> >>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> >>>>>>> wrote: >>>>>>>> sync simply flushes the channel between the leader and the follower >>>>>>>> that forwarded the sync operation, so it doesn't go through the full >>>>>>>> zab pipeline. Flushing means that all pending updates from the leader >>>>>>>> to the follower are received by the time sync completes. Starting a >>>>>>>> read operation concurrently with a sync implies that the result of the >>>>>>>> read will not miss an update committed before the read started. >>>>>>>> >>>>>>>> -Flavio >>>>>>>> >>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: >>>>>>>> >>>>>>>>> Its strange that sync doesn't run through agreement, I was always >>>>>>>>> assuming that it is... Exactly for the reason you say - >>>>>>>>> you may trust your leader, but I may have a different leader and your >>>>>>>>> leader may not detect it yet and still think its the leader. >>>>>>>>> >>>>>>>>> This seems like a bug to me. >>>>>>>>> >>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't) >>>>>>>>> depend on timing assumption. >>>>>>>>> Only progress guarantees depend on time. >>>>>>>>> >>>>>>>>> Alex >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> I have some pretty strong requirements in terms of consistency where >>>>>>>>>> reading from followers that may be behind in terms of updates isn't >>>>>>>>>> ok for >>>>>>>>>> my use case. >>>>>>>>>> >>>>>>>>>> One error case that worries me is if a follower and leader are >>>>>>>>>> partitioned >>>>>>>>>> off from the network. A new leader is elected, but the follower and >>>>>>>>>> old >>>>>>>>>> leader don't know about it. >>>>>>>>>> >>>>>>>>>> Normally I think sync was made for this purpost, but I looked at the >>>>>>>>>> sync >>>>>>>>>> code and if there aren't any outstanding proposals the leader sends >>>>>>>>>> the >>>>>>>>>> sync right back to the client without first verifying that it still >>>>>>>>>> has >>>>>>>>>> quorum, so this won't work for my use case. >>>>>>>>>> >>>>>>>>>> At the core of the issue all I really need is a call that will make >>>>>>>>>> it's >>>>>>>>>> way to the leader and will ping it's followers, ensure it still has a >>>>>>>>>> quorum and return success. >>>>>>>>>> >>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to >>>>>>>>>> the >>>>>>>>>> leader, leader will ensure it still has quorum and return it's >>>>>>>>>> epoch. I >>>>>>>>>> can use this primitive to implement all the other properties I want >>>>>>>>>> to >>>>>>>>>> verify (assuming that my client will never connect to an older epoch >>>>>>>>>> after >>>>>>>>>> this call returns). Also the nice thing about this method is that it >>>>>>>>>> will >>>>>>>>>> not have to hit disk and the latency should just be a round trip to >>>>>>>>>> the >>>>>>>>>> followers. >>>>>>>>>> >>>>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on >>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees >>>>>>>>>> in >>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other >>>>>>>>>> way >>>>>>>>>> time is broken. >>>>>>>>>> >>>>>>>>>> Also if people are interested I can go into more detail about what I >>>>>>>>>> am >>>>>>>>>> trying to write. >>>>>>>>>> >>>>>>>>>> -jc >>>>>>>> >>>>>> >>>> >
