Bumping this thread as it was brought up again by Hongchao in the context of failing tests. I agree with him that its much better to make sync a quorum operation, however this may come in the expense of slightly slower strong reads. I think this is acceptable since if someone does "sync + read" he cares more about semantics than latency.
On Fri, Mar 1, 2013 at 7:06 PM, Alexander Shraer <[email protected]> wrote: > This is an old thread (below), but it doesn't seem like any conclusion > was reached on what we want to do to address the issue. > > Reminder of the problem: sync only gets you strong semantics if there > is no leader change. If there is a leader change, > then these semantics are guaranteed only if we make some timing > assumptions, not made elsewhere in ZooKeeper. It would be much > better not to make timing assumptions for such safety/consistency > properties, only for liveness. > > The problem happens when your leader is no longer the leader but > doesn't know it yet. He responds to a sync, but that doesn't mean > you follower sees all committed state. Some other server may have > already become the leader and committed some updates, which the sync > won't flush to > your follower, which is still connected to the old leader. > > To prevent this we should broadcast the sync like updates, or > piggyback them on other ops, or perhaps create a new type of sync that > is broadcasted. > > As Ben pointed out, this problem is also mentioned in Section 4.4 of > the ZooKeeper peper (but the proposed solution there is insufficient > to solve the issue, as > discussed below). > > Alex > > > On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <[email protected]> > wrote: > > Ben, after thinking about this more. I don't think this solution gets the > > property that I need. Just because there are outstanding proposals that > are > > committed later doesn't imply we are still the leader. It only means > that > > when the new leader does recovery it will also see these proposals as > > committed. > > > > Let's say we have a 5 node cluster and L1 has one pending request out. > F2-F5 > > are followers. We get back an ack from F2. Now F5 and L1 are partitioned > > off from the network along with client C1. > > > > Recovery happens on F2-F4 and F2 becomes L2. During recovery this > proposal > > is accepted because F2 had acked it. Now L2 does a bunch of stuff > including > > deleting your ephemeral node. > > > > Now a sync comes in from C1 through F5. Now L1 finally gets that ack > from F5 > > and goes ahead and commits it and responds to the outstanding sync > request > > to C1. > > > > We can see with this ordering there isn't a happens after relationship > > between the sync request and knowing about all commits that occurred > before > > the sync request. > > > > Yes, I realize that this ordering is unlikely to happen in practice, but > I > > hate trusting time for anything. > > > > -jc > > > > On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <[email protected]> > > wrote: > >> > >> This seems like a good compromise. We still have to eat the latency of > a > >> write, but we easily achieve smart batching in this case so many > outstanding > >> sync can all be serviced by the same lastPending request. > >> > >> -jc > >> > >> > >> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <[email protected]> > >> wrote: > >>> > >>> there is a very easy solution to this. we only rely on clocks in the > >>> case that there are no pending transactions. (if there are pending > >>> transactions, the sync will only return if in fact the leader is still > >>> the leader, otherwise the transaction that the sync is waiting on will > >>> never commit and the sync will never return.) > >>> > >>> so, if there aren't any transactions, just submit one. make it a bogus > >>> one: create / for example. then queue the sync behind it. > >>> > >>> ben > >>> > >>> ps - we bring up this issue and the solution and the rational for the > >>> current implementation in section 4.4 of the zookeeper usenix paper. > >>> > >>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <[email protected]> > >>> wrote: > >>> > So I think it's time to explain what I'm writing just so everyone has > >>> > more > >>> > situation awareness. Its just a timestamp server, nothing fancy. > >>> > > >>> > Looks like this: > >>> > > >>> > public interface TimestampService { > >>> > /** > >>> > * This will get a fresh timestamp that is guarenteed to be newer > >>> > than > >>> > any other timestamp > >>> > * handed out before this method was called. > >>> > */ > >>> > long getFreshTimestamp(); > >>> > } > >>> > > >>> > The only requirement is that the timestamp handed back is greater > than > >>> > every > >>> > other timestamp that was returned before getFreshTs was called. > There > >>> > is no > >>> > ordering requirement for concurrent requests. > >>> > > >>> > My impl is to reserve blocks of timestamps that are safe to hand out > >>> > (1M at > >>> > a time) using compare and swap in ZK. > >>> > lastPossibleUsed = read(HighWater) > >>> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M) > >>> > > >>> > Now my leader can hand back timestamps up to safeToHandout, but > before > >>> > it > >>> > hands one out it must ensure it is still the leader (no one else has > >>> > handed > >>> > back something higher). > >>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is > the > >>> > case. > >>> > Now I have a service that is guarenteed to be correct, but doesn't > >>> > require > >>> > disk hits in the steady state which brings down my latency (if you > get > >>> > close > >>> > to running out, you can compareAndSwap for more timestamps). > >>> > > >>> > If many requests come in at the same time I can use smart batching to > >>> > verify > >>> > happens after for all at once. We can also add more layers if we > need > >>> > more > >>> > bandwidth to scale up at the cost of adding latency. Basically our > >>> > latency > >>> > will be O(lg(requestRate)) if we keep adding layers as each previous > >>> > layer > >>> > becomes saturated. > >>> > > >>> > I hope this explanation helps. I am busy for the next 4 hours, but if > >>> > you > >>> > need more clarification I can respond to them at that time. > >>> > > >>> > -jc > >>> > > >>> > > >>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino < > [email protected]> > >>> > wrote: > >>> >> > >>> >> First, thanks everyone for talking this through with me. > >>> >> > >>> >> Flavio, for your example, this is actually ok. There is a happens > >>> >> after > >>> >> relationship between the client making the request and my leader C1 > >>> >> still > >>> >> being the leader. My service only needs to guarantee that what it > >>> >> hands > >>> >> back is at least as new as anything that existed when the client > made > >>> >> the > >>> >> request. If C2 were to answer requests while C1 is stalling that is > >>> >> ok > >>> >> because these would be considered concurrent requests and the stuff > >>> >> returned > >>> >> by C2 may be newer but that doesn't violate any guarentees. > >>> >> > >>> >> If some client were to get back something from C2 and then (happens > >>> >> after > >>> >> relationship) someone tried to read from C1, it needs to fail. > >>> >> > >>> >> To address your concern of adding too much bandwidth we can get this > >>> >> easily by doing what Martin Thompson calls smart batching > >>> >> ( > http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html). > >>> >> > >>> >> 1. ensureQuorum request comes in to L1 > >>> >> 2. send ENSURE to all followers > >>> >> 3. 10 more ensureQuorum requests come in > >>> >> 4. get back ENSURE from quorum > >>> >> 5. we can now service all 10 pending ensureQuorum requests with > >>> >> another > >>> >> round trip ENSURE. > >>> >> > >>> >> We don't need to send an ENSURE for every ensureQuorum request, we > >>> >> just > >>> >> need it to be happens after from when the request arrived. > >>> >> > >>> >> I am fine with the Ephemeral node being removed after some time > >>> >> expires, > >>> >> but only by the leader. If the leaders clock is broken and the > client > >>> >> owning the Ephemeral node drops off, then we don't have liveness > >>> >> (because > >>> >> this node may not get cleaned up in a timely fashion). However, we > >>> >> still > >>> >> preserve corectness. > >>> >> > >>> >> -jc > >>> >> > >>> >> > >>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira < > [email protected]> > >>> >> wrote: > >>> >>> > >>> >>> Say that we implement what you're suggesting. Could you check if > this > >>> >>> scenario can happen: > >>> >>> > >>> >>> 1- Client C1 is the current leader and it super boosted read to > make > >>> >>> sure > >>> >>> it is still the leader; > >>> >>> 2- We process the super boosted read having it through the zab > >>> >>> pipeline; > >>> >>> 3- When we send the response to C1 we slow down the whole deal: the > >>> >>> response to C1 gets delayed and we stall C1; > >>> >>> 4- In the meanwhile, C1's session expires on the server side and > its > >>> >>> ephemeral leadership node is removed; > >>> >>> 5- A new client C2 is elected and starts exercising leadership; > >>> >>> 6- Now C1 comes back to normal and receives the response of the > super > >>> >>> boosted read saying that it is still the leader. > >>> >>> > >>> >>> If my interpretation is not incorrect, the only way to prevent this > >>> >>> scenario from happening is if the session expires on the client > side > >>> >>> before > >>> >>> it receives the response of the read. It doesn't look like we can > do > >>> >>> it if > >>> >>> process clocks can be arbitrarily delayed. > >>> >>> > >>> >>> Note that one issue is that the behavior of ephemerals is highly > >>> >>> dependent upon timers, so I don't think we can avoid making some > >>> >>> timing > >>> >>> assumptions altogether. The question is if we are better off with a > >>> >>> mechanism relying upon acknowledgements. My sense is that > >>> >>> application-level > >>> >>> fencing is preferable (if not necessary) for applications like the > >>> >>> ones JC > >>> >>> is mentioning or BookKeeper. > >>> >>> > >>> >>> I'm not concerned about writes to disk, which I agree we don't need > >>> >>> for > >>> >>> sync. I'm more concerned about having it going through the whole > >>> >>> pipeline, > >>> >>> which will induce more traffic to zab and increase latency for an > >>> >>> application that uses it heavily. > >>> >>> > >>> >>> -Flavio > >>> >>> > >>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote: > >>> >>> > >>> >>> > another idea is to add this functionality to MultiOp - have read > >>> >>> > only > >>> >>> > transactions be replicated but not logged or logged > asynchronously. > >>> >>> > I'm not sure how it works right now if I do a read-only MultiOp > >>> >>> > transaction - does it replicate the transaction or answer it > >>> >>> > locally > >>> >>> > on the leader ? > >>> >>> > > >>> >>> > Alex > >>> >>> > > >>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer > >>> >>> > <[email protected]> > >>> >>> > wrote: > >>> >>> >> Thanks for the explanation. > >>> >>> >> > >>> >>> >> I guess one could always invoke a write operation instead of > sync > >>> >>> >> to > >>> >>> >> get the more strict semantics, but as John suggests, it might > be a > >>> >>> >> good idea to add a new type of operation that requires followers > >>> >>> >> to > >>> >>> >> ack but doesn't require them to log to disk - this seems > >>> >>> >> sufficient in > >>> >>> >> our case. > >>> >>> >> > >>> >>> >> Alex > >>> >>> >> > >>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira > >>> >>> >> <[email protected]> > >>> >>> >> wrote: > >>> >>> >>> In theory, the scenario you're describing could happen, but I > >>> >>> >>> would > >>> >>> >>> argue that it is unlikely given that: 1) a leader pings > followers > >>> >>> >>> twice a > >>> >>> >>> tick to make sure that it has a quorum of supporters (lead()); > 2) > >>> >>> >>> followers > >>> >>> >>> give up on a leader upon catching an exception > (followLeader()). > >>> >>> >>> One could > >>> >>> >>> calibrate tickTime to make the probability of having this > >>> >>> >>> scenario low. > >>> >>> >>> > >>> >>> >>> Let me also revisit the motivation for the way we designed > sync. > >>> >>> >>> ZooKeeper has been designed to serve reads efficiently and > making > >>> >>> >>> sync go > >>> >>> >>> through the pipeline would slow down reads. Although optional, > we > >>> >>> >>> thought it > >>> >>> >>> would be a good idea to make it as efficient as possible to > >>> >>> >>> comply with the > >>> >>> >>> original expectations for the service. We consequently came up > >>> >>> >>> with this > >>> >>> >>> cheap way of making sure that a read sees all pending updates. > It > >>> >>> >>> is correct > >>> >>> >>> that there are some corner cases that it doesn't cover. One is > >>> >>> >>> the case you > >>> >>> >>> mentioned. Another is having the sync finishing before the > client > >>> >>> >>> submits > >>> >>> >>> the read and having a write committing in between. We rely upon > >>> >>> >>> the way we > >>> >>> >>> implement timeouts and some minimum degree of synchrony for the > >>> >>> >>> clients when > >>> >>> >>> submitting operations to guarantee that the scheme work. > >>> >>> >>> > >>> >>> >>> We thought about the option of having the sync operation going > >>> >>> >>> through the pipeline, and in fact it would have been easier to > >>> >>> >>> implement it > >>> >>> >>> just as a regular write, but we opted not to because we felt it > >>> >>> >>> was > >>> >>> >>> sufficient for the use cases we had and more efficient as I > >>> >>> >>> already argued. > >>> >>> >>> > >>> >>> >>> Hope it helps to clarify. > >>> >>> >>> > >>> >>> >>> -Flavio > >>> >>> >>> > >>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: > >>> >>> >>> > >>> >>> >>>> thanks for the explanation! but how do you avoid having the > >>> >>> >>>> scenario > >>> >>> >>>> raised by John ? > >>> >>> >>>> lets say you're a client connected to F, and F is connected to > >>> >>> >>>> L. > >>> >>> >>>> Lets > >>> >>> >>>> also say that L's pipeline > >>> >>> >>>> is now empty, and both F and L are partitioned from 3 other > >>> >>> >>>> servers > >>> >>> >>>> in > >>> >>> >>>> the system that have already > >>> >>> >>>> elected a new leader L'. Now I go to L' and write something. L > >>> >>> >>>> still > >>> >>> >>>> thinks its the leader because the > >>> >>> >>>> detection that followers left it is obviously timeout > dependent. > >>> >>> >>>> So > >>> >>> >>>> when F sends your sync to L and L returns > >>> >>> >>>> it to F, you actually miss my write! > >>> >>> >>>> > >>> >>> >>>> Alex > >>> >>> >>>> > >>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira > >>> >>> >>>> <[email protected]> wrote: > >>> >>> >>>>> Hi Alex, Because of the following: > >>> >>> >>>>> > >>> >>> >>>>> 1- A follower F processes operations from a client in FIFO > >>> >>> >>>>> order, > >>> >>> >>>>> and say that a client submits as you say sync + read; > >>> >>> >>>>> 2- A sync will be processed by the leader and returned to the > >>> >>> >>>>> follower. It will be queued after all pending updates that > the > >>> >>> >>>>> follower > >>> >>> >>>>> hasn't processed; > >>> >>> >>>>> 3- The follower will process all pending updates before > >>> >>> >>>>> processing > >>> >>> >>>>> the response of the sync; > >>> >>> >>>>> 4- Once the follower processes the sync, it picks the read > >>> >>> >>>>> operation to process. It reads the local state of the > follower > >>> >>> >>>>> and returns > >>> >>> >>>>> to the client. > >>> >>> >>>>> > >>> >>> >>>>> When we process the read in Step 4, we have applied all > pending > >>> >>> >>>>> updates the leader had for the follower by the time the read > >>> >>> >>>>> request > >>> >>> >>>>> started. > >>> >>> >>>>> > >>> >>> >>>>> This implementation is a bit of a hack because it doesn't > >>> >>> >>>>> follow > >>> >>> >>>>> the same code path as the other operations that go to the > >>> >>> >>>>> leader, but it > >>> >>> >>>>> avoids some unnecessary steps, which is important for fast > >>> >>> >>>>> reads. In the > >>> >>> >>>>> sync case, the other followers don't really need to know > about > >>> >>> >>>>> it (there is > >>> >>> >>>>> nothing to be updated) and the leader simply inserts it in > the > >>> >>> >>>>> sequence of > >>> >>> >>>>> updates of F, ordering it. > >>> >>> >>>>> > >>> >>> >>>>> -Flavio > >>> >>> >>>>> > >>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: > >>> >>> >>>>> > >>> >>> >>>>>> Hi Flavio, > >>> >>> >>>>>> > >>> >>> >>>>>>> Starting a read operation concurrently with a sync implies > >>> >>> >>>>>>> that > >>> >>> >>>>>>> the result of the read will not miss an update committed > >>> >>> >>>>>>> before the read > >>> >>> >>>>>>> started. > >>> >>> >>>>>> > >>> >>> >>>>>> I thought that the intention of sync was to give something > >>> >>> >>>>>> like > >>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read, > >>> >>> >>>>>> your > >>> >>> >>>>>> read > >>> >>> >>>>>> is guaranteed to (at least) see any write which completed > >>> >>> >>>>>> before > >>> >>> >>>>>> the > >>> >>> >>>>>> sync began. Is this the intention ? If so, how is this > >>> >>> >>>>>> achieved > >>> >>> >>>>>> without running agreement on the sync op ? > >>> >>> >>>>>> > >>> >>> >>>>>> Thanks, > >>> >>> >>>>>> Alex > >>> >>> >>>>>> > >>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira > >>> >>> >>>>>> <[email protected]> wrote: > >>> >>> >>>>>>> sync simply flushes the channel between the leader and the > >>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't > go > >>> >>> >>>>>>> through the > >>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates > >>> >>> >>>>>>> from the leader > >>> >>> >>>>>>> to the follower are received by the time sync completes. > >>> >>> >>>>>>> Starting a read > >>> >>> >>>>>>> operation concurrently with a sync implies that the result > of > >>> >>> >>>>>>> the read will > >>> >>> >>>>>>> not miss an update committed before the read started. > >>> >>> >>>>>>> > >>> >>> >>>>>>> -Flavio > >>> >>> >>>>>>> > >>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: > >>> >>> >>>>>>> > >>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was > >>> >>> >>>>>>>> always > >>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say - > >>> >>> >>>>>>>> you may trust your leader, but I may have a different > leader > >>> >>> >>>>>>>> and > >>> >>> >>>>>>>> your > >>> >>> >>>>>>>> leader may not detect it yet and still think its the > leader. > >>> >>> >>>>>>>> > >>> >>> >>>>>>>> This seems like a bug to me. > >>> >>> >>>>>>>> > >>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't > (or > >>> >>> >>>>>>>> shouldn't) > >>> >>> >>>>>>>> depend on timing assumption. > >>> >>> >>>>>>>> Only progress guarantees depend on time. > >>> >>> >>>>>>>> > >>> >>> >>>>>>>> Alex > >>> >>> >>>>>>>> > >>> >>> >>>>>>>> > >>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino > >>> >>> >>>>>>>> <[email protected]> wrote: > >>> >>> >>>>>>>>> I have some pretty strong requirements in terms of > >>> >>> >>>>>>>>> consistency > >>> >>> >>>>>>>>> where > >>> >>> >>>>>>>>> reading from followers that may be behind in terms of > >>> >>> >>>>>>>>> updates > >>> >>> >>>>>>>>> isn't ok for > >>> >>> >>>>>>>>> my use case. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> One error case that worries me is if a follower and > leader > >>> >>> >>>>>>>>> are > >>> >>> >>>>>>>>> partitioned > >>> >>> >>>>>>>>> off from the network. A new leader is elected, but the > >>> >>> >>>>>>>>> follower and old > >>> >>> >>>>>>>>> leader don't know about it. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I > >>> >>> >>>>>>>>> looked > >>> >>> >>>>>>>>> at the sync > >>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the > >>> >>> >>>>>>>>> leader > >>> >>> >>>>>>>>> sends the > >>> >>> >>>>>>>>> sync right back to the client without first verifying > that > >>> >>> >>>>>>>>> it > >>> >>> >>>>>>>>> still has > >>> >>> >>>>>>>>> quorum, so this won't work for my use case. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> At the core of the issue all I really need is a call that > >>> >>> >>>>>>>>> will > >>> >>> >>>>>>>>> make it's > >>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it > >>> >>> >>>>>>>>> still > >>> >>> >>>>>>>>> has a > >>> >>> >>>>>>>>> quorum and return success. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be > >>> >>> >>>>>>>>> forwarded to the > >>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return > >>> >>> >>>>>>>>> it's > >>> >>> >>>>>>>>> epoch. I > >>> >>> >>>>>>>>> can use this primitive to implement all the other > >>> >>> >>>>>>>>> properties I > >>> >>> >>>>>>>>> want to > >>> >>> >>>>>>>>> verify (assuming that my client will never connect to an > >>> >>> >>>>>>>>> older > >>> >>> >>>>>>>>> epoch after > >>> >>> >>>>>>>>> this call returns). Also the nice thing about this method > >>> >>> >>>>>>>>> is > >>> >>> >>>>>>>>> that it will > >>> >>> >>>>>>>>> not have to hit disk and the latency should just be a > round > >>> >>> >>>>>>>>> trip to the > >>> >>> >>>>>>>>> followers. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time > based > >>> >>> >>>>>>>>> an > >>> >>> >>>>>>>>> rely on > >>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some > >>> >>> >>>>>>>>> guarantees in > >>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and > >>> >>> >>>>>>>>> any > >>> >>> >>>>>>>>> other way > >>> >>> >>>>>>>>> time is broken. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> Also if people are interested I can go into more detail > >>> >>> >>>>>>>>> about > >>> >>> >>>>>>>>> what I am > >>> >>> >>>>>>>>> trying to write. > >>> >>> >>>>>>>>> > >>> >>> >>>>>>>>> -jc > >>> >>> >>>>>>> > >>> >>> >>>>> > >>> >>> >>> > >>> >>> > >>> >> > >>> > > >> > >> > > >
