another idea is to add this functionality to MultiOp - have read only transactions be replicated but not logged or logged asynchronously. I'm not sure how it works right now if I do a read-only MultiOp transaction - does it replicate the transaction or answer it locally on the leader ?
Alex On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected]> wrote: > Thanks for the explanation. > > I guess one could always invoke a write operation instead of sync to > get the more strict semantics, but as John suggests, it might be a > good idea to add a new type of operation that requires followers to > ack but doesn't require them to log to disk - this seems sufficient in > our case. > > Alex > > On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <[email protected]> wrote: >> In theory, the scenario you're describing could happen, but I would argue >> that it is unlikely given that: 1) a leader pings followers twice a tick to >> make sure that it has a quorum of supporters (lead()); 2) followers give up >> on a leader upon catching an exception (followLeader()). One could calibrate >> tickTime to make the probability of having this scenario low. >> >> Let me also revisit the motivation for the way we designed sync. ZooKeeper >> has been designed to serve reads efficiently and making sync go through the >> pipeline would slow down reads. Although optional, we thought it would be a >> good idea to make it as efficient as possible to comply with the original >> expectations for the service. We consequently came up with this cheap way of >> making sure that a read sees all pending updates. It is correct that there >> are some corner cases that it doesn't cover. One is the case you mentioned. >> Another is having the sync finishing before the client submits the read and >> having a write committing in between. We rely upon the way we implement >> timeouts and some minimum degree of synchrony for the clients when >> submitting operations to guarantee that the scheme work. >> >> We thought about the option of having the sync operation going through the >> pipeline, and in fact it would have been easier to implement it just as a >> regular write, but we opted not to because we felt it was sufficient for the >> use cases we had and more efficient as I already argued. >> >> Hope it helps to clarify. >> >> -Flavio >> >> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: >> >>> thanks for the explanation! but how do you avoid having the scenario >>> raised by John ? >>> lets say you're a client connected to F, and F is connected to L. Lets >>> also say that L's pipeline >>> is now empty, and both F and L are partitioned from 3 other servers in >>> the system that have already >>> elected a new leader L'. Now I go to L' and write something. L still >>> thinks its the leader because the >>> detection that followers left it is obviously timeout dependent. So >>> when F sends your sync to L and L returns >>> it to F, you actually miss my write! >>> >>> Alex >>> >>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <[email protected]> >>> wrote: >>>> Hi Alex, Because of the following: >>>> >>>> 1- A follower F processes operations from a client in FIFO order, and say >>>> that a client submits as you say sync + read; >>>> 2- A sync will be processed by the leader and returned to the follower. It >>>> will be queued after all pending updates that the follower hasn't >>>> processed; >>>> 3- The follower will process all pending updates before processing the >>>> response of the sync; >>>> 4- Once the follower processes the sync, it picks the read operation to >>>> process. It reads the local state of the follower and returns to the >>>> client. >>>> >>>> When we process the read in Step 4, we have applied all pending updates >>>> the leader had for the follower by the time the read request started. >>>> >>>> This implementation is a bit of a hack because it doesn't follow the same >>>> code path as the other operations that go to the leader, but it avoids >>>> some unnecessary steps, which is important for fast reads. In the sync >>>> case, the other followers don't really need to know about it (there is >>>> nothing to be updated) and the leader simply inserts it in the sequence of >>>> updates of F, ordering it. >>>> >>>> -Flavio >>>> >>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: >>>> >>>>> Hi Flavio, >>>>> >>>>>> Starting a read operation concurrently with a sync implies that the >>>>>> result of the read will not miss an update committed before the read >>>>>> started. >>>>> >>>>> I thought that the intention of sync was to give something like >>>>> linearizable reads, so if you invoke a sync and then a read, your read >>>>> is guaranteed to (at least) see any write which completed before the >>>>> sync began. Is this the intention ? If so, how is this achieved >>>>> without running agreement on the sync op ? >>>>> >>>>> Thanks, >>>>> Alex >>>>> >>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <[email protected]> >>>>> wrote: >>>>>> sync simply flushes the channel between the leader and the follower that >>>>>> forwarded the sync operation, so it doesn't go through the full zab >>>>>> pipeline. Flushing means that all pending updates from the leader to the >>>>>> follower are received by the time sync completes. Starting a read >>>>>> operation concurrently with a sync implies that the result of the read >>>>>> will not miss an update committed before the read started. >>>>>> >>>>>> -Flavio >>>>>> >>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: >>>>>> >>>>>>> Its strange that sync doesn't run through agreement, I was always >>>>>>> assuming that it is... Exactly for the reason you say - >>>>>>> you may trust your leader, but I may have a different leader and your >>>>>>> leader may not detect it yet and still think its the leader. >>>>>>> >>>>>>> This seems like a bug to me. >>>>>>> >>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't) >>>>>>> depend on timing assumption. >>>>>>> Only progress guarantees depend on time. >>>>>>> >>>>>>> Alex >>>>>>> >>>>>>> >>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]> >>>>>>> wrote: >>>>>>>> I have some pretty strong requirements in terms of consistency where >>>>>>>> reading from followers that may be behind in terms of updates isn't ok >>>>>>>> for >>>>>>>> my use case. >>>>>>>> >>>>>>>> One error case that worries me is if a follower and leader are >>>>>>>> partitioned >>>>>>>> off from the network. A new leader is elected, but the follower and >>>>>>>> old >>>>>>>> leader don't know about it. >>>>>>>> >>>>>>>> Normally I think sync was made for this purpost, but I looked at the >>>>>>>> sync >>>>>>>> code and if there aren't any outstanding proposals the leader sends the >>>>>>>> sync right back to the client without first verifying that it still has >>>>>>>> quorum, so this won't work for my use case. >>>>>>>> >>>>>>>> At the core of the issue all I really need is a call that will make >>>>>>>> it's >>>>>>>> way to the leader and will ping it's followers, ensure it still has a >>>>>>>> quorum and return success. >>>>>>>> >>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to >>>>>>>> the >>>>>>>> leader, leader will ensure it still has quorum and return it's epoch. >>>>>>>> I >>>>>>>> can use this primitive to implement all the other properties I want to >>>>>>>> verify (assuming that my client will never connect to an older epoch >>>>>>>> after >>>>>>>> this call returns). Also the nice thing about this method is that it >>>>>>>> will >>>>>>>> not have to hit disk and the latency should just be a round trip to the >>>>>>>> followers. >>>>>>>> >>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on >>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in >>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way >>>>>>>> time is broken. >>>>>>>> >>>>>>>> Also if people are interested I can go into more detail about what I am >>>>>>>> trying to write. >>>>>>>> >>>>>>>> -jc >>>>>> >>>> >>
