This seems like a good compromise. We still have to eat the latency of a write, but we easily achieve smart batching in this case so many outstanding sync can all be serviced by the same lastPending request.
-jc On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <[email protected]> wrote: > there is a very easy solution to this. we only rely on clocks in the > case that there are no pending transactions. (if there are pending > transactions, the sync will only return if in fact the leader is still > the leader, otherwise the transaction that the sync is waiting on will > never commit and the sync will never return.) > > so, if there aren't any transactions, just submit one. make it a bogus > one: create / for example. then queue the sync behind it. > > ben > > ps - we bring up this issue and the solution and the rational for the > current implementation in section 4.4 of the zookeeper usenix paper. > > On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <[email protected]> > wrote: > > So I think it's time to explain what I'm writing just so everyone has > more > > situation awareness. Its just a timestamp server, nothing fancy. > > > > Looks like this: > > > > public interface TimestampService { > > /** > > * This will get a fresh timestamp that is guarenteed to be newer > than > > any other timestamp > > * handed out before this method was called. > > */ > > long getFreshTimestamp(); > > } > > > > The only requirement is that the timestamp handed back is greater than > every > > other timestamp that was returned before getFreshTs was called. There > is no > > ordering requirement for concurrent requests. > > > > My impl is to reserve blocks of timestamps that are safe to hand out (1M > at > > a time) using compare and swap in ZK. > > lastPossibleUsed = read(HighWater) > > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M) > > > > Now my leader can hand back timestamps up to safeToHandout, but before it > > hands one out it must ensure it is still the leader (no one else has > handed > > back something higher). > > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the > case. > > Now I have a service that is guarenteed to be correct, but doesn't > require > > disk hits in the steady state which brings down my latency (if you get > close > > to running out, you can compareAndSwap for more timestamps). > > > > If many requests come in at the same time I can use smart batching to > verify > > happens after for all at once. We can also add more layers if we need > more > > bandwidth to scale up at the cost of adding latency. Basically our > latency > > will be O(lg(requestRate)) if we keep adding layers as each previous > layer > > becomes saturated. > > > > I hope this explanation helps. I am busy for the next 4 hours, but if you > > need more clarification I can respond to them at that time. > > > > -jc > > > > > > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected]> > > wrote: > >> > >> First, thanks everyone for talking this through with me. > >> > >> Flavio, for your example, this is actually ok. There is a happens after > >> relationship between the client making the request and my leader C1 > still > >> being the leader. My service only needs to guarantee that what it hands > >> back is at least as new as anything that existed when the client made > the > >> request. If C2 were to answer requests while C1 is stalling that is ok > >> because these would be considered concurrent requests and the stuff > returned > >> by C2 may be newer but that doesn't violate any guarentees. > >> > >> If some client were to get back something from C2 and then (happens > after > >> relationship) someone tried to read from C1, it needs to fail. > >> > >> To address your concern of adding too much bandwidth we can get this > >> easily by doing what Martin Thompson calls smart batching > >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html). > >> > >> 1. ensureQuorum request comes in to L1 > >> 2. send ENSURE to all followers > >> 3. 10 more ensureQuorum requests come in > >> 4. get back ENSURE from quorum > >> 5. we can now service all 10 pending ensureQuorum requests with another > >> round trip ENSURE. > >> > >> We don't need to send an ENSURE for every ensureQuorum request, we just > >> need it to be happens after from when the request arrived. > >> > >> I am fine with the Ephemeral node being removed after some time expires, > >> but only by the leader. If the leaders clock is broken and the client > >> owning the Ephemeral node drops off, then we don't have liveness > (because > >> this node may not get cleaned up in a timely fashion). However, we > still > >> preserve corectness. > >> > >> -jc > >> > >> > >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected]> > >> wrote: > >>> > >>> Say that we implement what you're suggesting. Could you check if this > >>> scenario can happen: > >>> > >>> 1- Client C1 is the current leader and it super boosted read to make > sure > >>> it is still the leader; > >>> 2- We process the super boosted read having it through the zab > pipeline; > >>> 3- When we send the response to C1 we slow down the whole deal: the > >>> response to C1 gets delayed and we stall C1; > >>> 4- In the meanwhile, C1's session expires on the server side and its > >>> ephemeral leadership node is removed; > >>> 5- A new client C2 is elected and starts exercising leadership; > >>> 6- Now C1 comes back to normal and receives the response of the super > >>> boosted read saying that it is still the leader. > >>> > >>> If my interpretation is not incorrect, the only way to prevent this > >>> scenario from happening is if the session expires on the client side > before > >>> it receives the response of the read. It doesn't look like we can do > it if > >>> process clocks can be arbitrarily delayed. > >>> > >>> Note that one issue is that the behavior of ephemerals is highly > >>> dependent upon timers, so I don't think we can avoid making some timing > >>> assumptions altogether. The question is if we are better off with a > >>> mechanism relying upon acknowledgements. My sense is that > application-level > >>> fencing is preferable (if not necessary) for applications like the > ones JC > >>> is mentioning or BookKeeper. > >>> > >>> I'm not concerned about writes to disk, which I agree we don't need for > >>> sync. I'm more concerned about having it going through the whole > pipeline, > >>> which will induce more traffic to zab and increase latency for an > >>> application that uses it heavily. > >>> > >>> -Flavio > >>> > >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote: > >>> > >>> > another idea is to add this functionality to MultiOp - have read only > >>> > transactions be replicated but not logged or logged asynchronously. > >>> > I'm not sure how it works right now if I do a read-only MultiOp > >>> > transaction - does it replicate the transaction or answer it locally > >>> > on the leader ? > >>> > > >>> > Alex > >>> > > >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <[email protected] > > > >>> > wrote: > >>> >> Thanks for the explanation. > >>> >> > >>> >> I guess one could always invoke a write operation instead of sync to > >>> >> get the more strict semantics, but as John suggests, it might be a > >>> >> good idea to add a new type of operation that requires followers to > >>> >> ack but doesn't require them to log to disk - this seems sufficient > in > >>> >> our case. > >>> >> > >>> >> Alex > >>> >> > >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira < > [email protected]> > >>> >> wrote: > >>> >>> In theory, the scenario you're describing could happen, but I would > >>> >>> argue that it is unlikely given that: 1) a leader pings followers > twice a > >>> >>> tick to make sure that it has a quorum of supporters (lead()); 2) > followers > >>> >>> give up on a leader upon catching an exception (followLeader()). > One could > >>> >>> calibrate tickTime to make the probability of having this scenario > low. > >>> >>> > >>> >>> Let me also revisit the motivation for the way we designed sync. > >>> >>> ZooKeeper has been designed to serve reads efficiently and making > sync go > >>> >>> through the pipeline would slow down reads. Although optional, we > thought it > >>> >>> would be a good idea to make it as efficient as possible to comply > with the > >>> >>> original expectations for the service. We consequently came up > with this > >>> >>> cheap way of making sure that a read sees all pending updates. It > is correct > >>> >>> that there are some corner cases that it doesn't cover. One is the > case you > >>> >>> mentioned. Another is having the sync finishing before the client > submits > >>> >>> the read and having a write committing in between. We rely upon > the way we > >>> >>> implement timeouts and some minimum degree of synchrony for the > clients when > >>> >>> submitting operations to guarantee that the scheme work. > >>> >>> > >>> >>> We thought about the option of having the sync operation going > >>> >>> through the pipeline, and in fact it would have been easier to > implement it > >>> >>> just as a regular write, but we opted not to because we felt it was > >>> >>> sufficient for the use cases we had and more efficient as I > already argued. > >>> >>> > >>> >>> Hope it helps to clarify. > >>> >>> > >>> >>> -Flavio > >>> >>> > >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: > >>> >>> > >>> >>>> thanks for the explanation! but how do you avoid having the > scenario > >>> >>>> raised by John ? > >>> >>>> lets say you're a client connected to F, and F is connected to L. > >>> >>>> Lets > >>> >>>> also say that L's pipeline > >>> >>>> is now empty, and both F and L are partitioned from 3 other > servers > >>> >>>> in > >>> >>>> the system that have already > >>> >>>> elected a new leader L'. Now I go to L' and write something. L > still > >>> >>>> thinks its the leader because the > >>> >>>> detection that followers left it is obviously timeout dependent. > So > >>> >>>> when F sends your sync to L and L returns > >>> >>>> it to F, you actually miss my write! > >>> >>>> > >>> >>>> Alex > >>> >>>> > >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira > >>> >>>> <[email protected]> wrote: > >>> >>>>> Hi Alex, Because of the following: > >>> >>>>> > >>> >>>>> 1- A follower F processes operations from a client in FIFO order, > >>> >>>>> and say that a client submits as you say sync + read; > >>> >>>>> 2- A sync will be processed by the leader and returned to the > >>> >>>>> follower. It will be queued after all pending updates that the > follower > >>> >>>>> hasn't processed; > >>> >>>>> 3- The follower will process all pending updates before > processing > >>> >>>>> the response of the sync; > >>> >>>>> 4- Once the follower processes the sync, it picks the read > >>> >>>>> operation to process. It reads the local state of the follower > and returns > >>> >>>>> to the client. > >>> >>>>> > >>> >>>>> When we process the read in Step 4, we have applied all pending > >>> >>>>> updates the leader had for the follower by the time the read > request > >>> >>>>> started. > >>> >>>>> > >>> >>>>> This implementation is a bit of a hack because it doesn't follow > >>> >>>>> the same code path as the other operations that go to the > leader, but it > >>> >>>>> avoids some unnecessary steps, which is important for fast > reads. In the > >>> >>>>> sync case, the other followers don't really need to know about > it (there is > >>> >>>>> nothing to be updated) and the leader simply inserts it in the > sequence of > >>> >>>>> updates of F, ordering it. > >>> >>>>> > >>> >>>>> -Flavio > >>> >>>>> > >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: > >>> >>>>> > >>> >>>>>> Hi Flavio, > >>> >>>>>> > >>> >>>>>>> Starting a read operation concurrently with a sync implies that > >>> >>>>>>> the result of the read will not miss an update committed > before the read > >>> >>>>>>> started. > >>> >>>>>> > >>> >>>>>> I thought that the intention of sync was to give something like > >>> >>>>>> linearizable reads, so if you invoke a sync and then a read, > your > >>> >>>>>> read > >>> >>>>>> is guaranteed to (at least) see any write which completed before > >>> >>>>>> the > >>> >>>>>> sync began. Is this the intention ? If so, how is this achieved > >>> >>>>>> without running agreement on the sync op ? > >>> >>>>>> > >>> >>>>>> Thanks, > >>> >>>>>> Alex > >>> >>>>>> > >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira > >>> >>>>>> <[email protected]> wrote: > >>> >>>>>>> sync simply flushes the channel between the leader and the > >>> >>>>>>> follower that forwarded the sync operation, so it doesn't go > through the > >>> >>>>>>> full zab pipeline. Flushing means that all pending updates > from the leader > >>> >>>>>>> to the follower are received by the time sync completes. > Starting a read > >>> >>>>>>> operation concurrently with a sync implies that the result of > the read will > >>> >>>>>>> not miss an update committed before the read started. > >>> >>>>>>> > >>> >>>>>>> -Flavio > >>> >>>>>>> > >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: > >>> >>>>>>> > >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was > >>> >>>>>>>> always > >>> >>>>>>>> assuming that it is... Exactly for the reason you say - > >>> >>>>>>>> you may trust your leader, but I may have a different leader > and > >>> >>>>>>>> your > >>> >>>>>>>> leader may not detect it yet and still think its the leader. > >>> >>>>>>>> > >>> >>>>>>>> This seems like a bug to me. > >>> >>>>>>>> > >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or > >>> >>>>>>>> shouldn't) > >>> >>>>>>>> depend on timing assumption. > >>> >>>>>>>> Only progress guarantees depend on time. > >>> >>>>>>>> > >>> >>>>>>>> Alex > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino > >>> >>>>>>>> <[email protected]> wrote: > >>> >>>>>>>>> I have some pretty strong requirements in terms of > consistency > >>> >>>>>>>>> where > >>> >>>>>>>>> reading from followers that may be behind in terms of updates > >>> >>>>>>>>> isn't ok for > >>> >>>>>>>>> my use case. > >>> >>>>>>>>> > >>> >>>>>>>>> One error case that worries me is if a follower and leader > are > >>> >>>>>>>>> partitioned > >>> >>>>>>>>> off from the network. A new leader is elected, but the > >>> >>>>>>>>> follower and old > >>> >>>>>>>>> leader don't know about it. > >>> >>>>>>>>> > >>> >>>>>>>>> Normally I think sync was made for this purpost, but I looked > >>> >>>>>>>>> at the sync > >>> >>>>>>>>> code and if there aren't any outstanding proposals the leader > >>> >>>>>>>>> sends the > >>> >>>>>>>>> sync right back to the client without first verifying that it > >>> >>>>>>>>> still has > >>> >>>>>>>>> quorum, so this won't work for my use case. > >>> >>>>>>>>> > >>> >>>>>>>>> At the core of the issue all I really need is a call that > will > >>> >>>>>>>>> make it's > >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it > still > >>> >>>>>>>>> has a > >>> >>>>>>>>> quorum and return success. > >>> >>>>>>>>> > >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be > >>> >>>>>>>>> forwarded to the > >>> >>>>>>>>> leader, leader will ensure it still has quorum and return > it's > >>> >>>>>>>>> epoch. I > >>> >>>>>>>>> can use this primitive to implement all the other properties > I > >>> >>>>>>>>> want to > >>> >>>>>>>>> verify (assuming that my client will never connect to an > older > >>> >>>>>>>>> epoch after > >>> >>>>>>>>> this call returns). Also the nice thing about this method is > >>> >>>>>>>>> that it will > >>> >>>>>>>>> not have to hit disk and the latency should just be a round > >>> >>>>>>>>> trip to the > >>> >>>>>>>>> followers. > >>> >>>>>>>>> > >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an > >>> >>>>>>>>> rely on > >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some > >>> >>>>>>>>> guarantees in > >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any > >>> >>>>>>>>> other way > >>> >>>>>>>>> time is broken. > >>> >>>>>>>>> > >>> >>>>>>>>> Also if people are interested I can go into more detail about > >>> >>>>>>>>> what I am > >>> >>>>>>>>> trying to write. > >>> >>>>>>>>> > >>> >>>>>>>>> -jc > >>> >>>>>>> > >>> >>>>> > >>> >>> > >>> > >> > > >
