Yes. This property needs to be global because otherwise another transaction can start and commit in the past when it shouldn't be allowed to.
-sent from my phone On Sep 28, 2012 8:13 AM, "Flavio Junqueira" <[email protected]> wrote: > I checked your message again and the contract is that getFreshTimestamp() > always gets a fresh timestamp (not older than any returned by that method). > One aspect of the contract that is not clear to me: does this freshness > property hold across clients or only for a single client? If it holds only > per client, then you can guarantee it with master epochs. No user of your > application will connect to the master of epoch y < x once it connects to > the master of epoch x. The epochs can be the sequence number of a > sequential znode. How does it sound? > > -Flavio > > On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote: > > > I was thinking that you can do a write per timestamp batch, and not per > individual timestamp. In the worst case, a former leader won't use some > timestamps, and I would think it is ok, but it depends on your application. > > > > Also, even if two clients believe they are leaders simultaneously and > serve timestamps, the property that you seem to care about the most is > uniqueness: a timestamp is not served twice. Order of timestamps would be > preserved per leader, but in the case of overlapping leaders you could end > up serving timestamps that do not follow an increasing order. > > > > -Flavio > > > > On Sep 28, 2012, at 4:37 PM, John Carrino wrote: > > > >> CompareAndSwap is an atomic check and update. Basically only update the > >> value if it is the same as the expected value. > >> > >> I think with your approach you'd have to do a write for every single > >> timestamp you wanted to hand out. The latency hit on this would too > much. > >> > >> My approach is different in that a timestamp server reserves a bunch of > >> timestamps up front and proceeds to hand them out as long as it is the > >> leader. Leader check can be done without hitting disk hopefully. > >> > >> Thanks! > >> > >> -jc > >> > >> > >> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <[email protected]> > wrote: > >> > >>> I don't know what your compareAndSwap method does, but I was wondering > if > >>> your client process can use conditional writes to a znode to make sure > that > >>> it was the last one to update the state of timestamp batches. You can > treat > >>> the master election problem separately and it does not have to be as > strict > >>> as you have been thinking you need. Thats is, it wouldn't hurt if a > client > >>> still thinks it is leading even if it is not because no two clients > will be > >>> able to update the state of timestamp blocks without noticing that > another > >>> client is also updating it. > >>> > >>> -Flavio > >>> > >>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote: > >>> > >>>> So I think it's time to explain what I'm writing just so everyone has > >>>> more situation awareness. Its just a timestamp server, nothing fancy. > >>>> > >>>> Looks like this: > >>>> > >>>> public interface TimestampService { > >>>> /** > >>>> * This will get a fresh timestamp that is guarenteed to be newer > than > >>>> any other timestamp > >>>> * handed out before this method was called. > >>>> */ > >>>> long getFreshTimestamp(); > >>>> } > >>>> > >>>> The only requirement is that the timestamp handed back is greater than > >>>> every other timestamp that was returned before getFreshTs was called. > >>>> There is no ordering requirement for concurrent requests. > >>>> > >>>> My impl is to reserve blocks of timestamps that are safe to hand out > (1M > >>> at > >>>> a time) using compare and swap in ZK. > >>>> lastPossibleUsed = read(HighWater) > >>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M) > >>>> > >>>> Now my leader can hand back timestamps up to safeToHandout, but > before it > >>>> hands one out it must ensure it is still the leader (no one else has > >>> handed > >>>> back something higher). > >>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the > >>>> case. Now I have a service that is guarenteed to be correct, but > doesn't > >>>> require disk hits in the steady state which brings down my latency (if > >>> you > >>>> get close to running out, you can compareAndSwap for more timestamps). > >>>> > >>>> If many requests come in at the same time I can use smart batching to > >>>> verify happens after for all at once. We can also add more layers if > we > >>>> need more bandwidth to scale up at the cost of adding latency. > Basically > >>>> our latency will be O(lg(requestRate)) if we keep adding layers as > each > >>>> previous layer becomes saturated. > >>>> > >>>> I hope this explanation helps. I am busy for the next 4 hours, but if > you > >>>> need more clarification I can respond to them at that time. > >>>> > >>>> -jc > >>>> > >>>> > >>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected] > >>>> wrote: > >>>> > >>>>> First, thanks everyone for talking this through with me. > >>>>> > >>>>> Flavio, for your example, this is actually ok. There is a happens > after > >>>>> relationship between the client making the request and my leader C1 > >>> still > >>>>> being the leader. My service only needs to guarantee that what it > hands > >>>>> back is at least as new as anything that existed when the client made > >>> the > >>>>> request. If C2 were to answer requests while C1 is stalling that is > ok > >>>>> because these would be considered concurrent requests and the stuff > >>>>> returned by C2 may be newer but that doesn't violate any guarentees. > >>>>> > >>>>> If some client were to get back something from C2 and then (happens > >>> after > >>>>> relationship) someone tried to read from C1, it needs to fail. > >>>>> > >>>>> To address your concern of adding too much bandwidth we can get this > >>>>> easily by doing what Martin Thompson calls smart batching ( > >>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html > ). > >>>>> > >>>>> 1. ensureQuorum request comes in to L1 > >>>>> 2. send ENSURE to all followers > >>>>> 3. 10 more ensureQuorum requests come in > >>>>> 4. get back ENSURE from quorum > >>>>> 5. we can now service all 10 pending ensureQuorum requests with > another > >>>>> round trip ENSURE. > >>>>> > >>>>> We don't need to send an ENSURE for every ensureQuorum request, we > just > >>>>> need it to be happens after from when the request arrived. > >>>>> > >>>>> I am fine with the Ephemeral node being removed after some time > expires, > >>>>> but only by the leader. If the leaders clock is broken and the > client > >>>>> owning the Ephemeral node drops off, then we don't have liveness > >>> (because > >>>>> this node may not get cleaned up in a timely fashion). However, we > >>> still > >>>>> preserve corectness. > >>>>> > >>>>> -jc > >>>>> > >>>>> > >>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected] > >>>> wrote: > >>>>> > >>>>>> Say that we implement what you're suggesting. Could you check if > this > >>>>>> scenario can happen: > >>>>>> > >>>>>> 1- Client C1 is the current leader and it super boosted read to make > >>> sure > >>>>>> it is still the leader; > >>>>>> 2- We process the super boosted read having it through the zab > >>> pipeline; > >>>>>> 3- When we send the response to C1 we slow down the whole deal: the > >>>>>> response to C1 gets delayed and we stall C1; > >>>>>> 4- In the meanwhile, C1's session expires on the server side and its > >>>>>> ephemeral leadership node is removed; > >>>>>> 5- A new client C2 is elected and starts exercising leadership; > >>>>>> 6- Now C1 comes back to normal and receives the response of the > super > >>>>>> boosted read saying that it is still the leader. > >>>>>> > >>>>>> If my interpretation is not incorrect, the only way to prevent this > >>>>>> scenario from happening is if the session expires on the client side > >>> before > >>>>>> it receives the response of the read. It doesn't look like we can do > >>> it if > >>>>>> process clocks can be arbitrarily delayed. > >>>>>> > >>>>>> Note that one issue is that the behavior of ephemerals is highly > >>>>>> dependent upon timers, so I don't think we can avoid making some > timing > >>>>>> assumptions altogether. The question is if we are better off with a > >>>>>> mechanism relying upon acknowledgements. My sense is that > >>> application-level > >>>>>> fencing is preferable (if not necessary) for applications like the > >>> ones JC > >>>>>> is mentioning or BookKeeper. > >>>>>> > >>>>>> I'm not concerned about writes to disk, which I agree we don't need > for > >>>>>> sync. I'm more concerned about having it going through the whole > >>> pipeline, > >>>>>> which will induce more traffic to zab and increase latency for an > >>>>>> application that uses it heavily. > >>>>>> > >>>>>> -Flavio > >>>>>> > >>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote: > >>>>>> > >>>>>>> another idea is to add this functionality to MultiOp - have read > only > >>>>>>> transactions be replicated but not logged or logged asynchronously. > >>>>>>> I'm not sure how it works right now if I do a read-only MultiOp > >>>>>>> transaction - does it replicate the transaction or answer it > locally > >>>>>>> on the leader ? > >>>>>>> > >>>>>>> Alex > >>>>>>> > >>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer < > [email protected]> > >>>>>> wrote: > >>>>>>>> Thanks for the explanation. > >>>>>>>> > >>>>>>>> I guess one could always invoke a write operation instead of sync > to > >>>>>>>> get the more strict semantics, but as John suggests, it might be a > >>>>>>>> good idea to add a new type of operation that requires followers > to > >>>>>>>> ack but doesn't require them to log to disk - this seems > sufficient > >>> in > >>>>>>>> our case. > >>>>>>>> > >>>>>>>> Alex > >>>>>>>> > >>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira < > [email protected] > >>>> > >>>>>> wrote: > >>>>>>>>> In theory, the scenario you're describing could happen, but I > would > >>>>>> argue that it is unlikely given that: 1) a leader pings followers > >>> twice a > >>>>>> tick to make sure that it has a quorum of supporters (lead()); 2) > >>> followers > >>>>>> give up on a leader upon catching an exception (followLeader()). One > >>> could > >>>>>> calibrate tickTime to make the probability of having this scenario > low. > >>>>>>>>> > >>>>>>>>> Let me also revisit the motivation for the way we designed sync. > >>>>>> ZooKeeper has been designed to serve reads efficiently and making > sync > >>> go > >>>>>> through the pipeline would slow down reads. Although optional, we > >>> thought > >>>>>> it would be a good idea to make it as efficient as possible to > comply > >>> with > >>>>>> the original expectations for the service. We consequently came up > with > >>>>>> this cheap way of making sure that a read sees all pending updates. > It > >>> is > >>>>>> correct that there are some corner cases that it doesn't cover. One > is > >>> the > >>>>>> case you mentioned. Another is having the sync finishing before the > >>> client > >>>>>> submits the read and having a write committing in between. We rely > >>> upon the > >>>>>> way we implement timeouts and some minimum degree of synchrony for > the > >>>>>> clients when submitting operations to guarantee that the scheme > work. > >>>>>>>>> > >>>>>>>>> We thought about the option of having the sync operation going > >>>>>> through the pipeline, and in fact it would have been easier to > >>> implement it > >>>>>> just as a regular write, but we opted not to because we felt it was > >>>>>> sufficient for the use cases we had and more efficient as I already > >>> argued. > >>>>>>>>> > >>>>>>>>> Hope it helps to clarify. > >>>>>>>>> > >>>>>>>>> -Flavio > >>>>>>>>> > >>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote: > >>>>>>>>> > >>>>>>>>>> thanks for the explanation! but how do you avoid having the > >>> scenario > >>>>>>>>>> raised by John ? > >>>>>>>>>> lets say you're a client connected to F, and F is connected to > L. > >>>>>> Lets > >>>>>>>>>> also say that L's pipeline > >>>>>>>>>> is now empty, and both F and L are partitioned from 3 other > servers > >>>>>> in > >>>>>>>>>> the system that have already > >>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L > >>> still > >>>>>>>>>> thinks its the leader because the > >>>>>>>>>> detection that followers left it is obviously timeout > dependent. So > >>>>>>>>>> when F sends your sync to L and L returns > >>>>>>>>>> it to F, you actually miss my write! > >>>>>>>>>> > >>>>>>>>>> Alex > >>>>>>>>>> > >>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira < > >>>>>> [email protected]> wrote: > >>>>>>>>>>> Hi Alex, Because of the following: > >>>>>>>>>>> > >>>>>>>>>>> 1- A follower F processes operations from a client in FIFO > order, > >>>>>> and say that a client submits as you say sync + read; > >>>>>>>>>>> 2- A sync will be processed by the leader and returned to the > >>>>>> follower. It will be queued after all pending updates that the > follower > >>>>>> hasn't processed; > >>>>>>>>>>> 3- The follower will process all pending updates before > processing > >>>>>> the response of the sync; > >>>>>>>>>>> 4- Once the follower processes the sync, it picks the read > >>>>>> operation to process. It reads the local state of the follower and > >>> returns > >>>>>> to the client. > >>>>>>>>>>> > >>>>>>>>>>> When we process the read in Step 4, we have applied all pending > >>>>>> updates the leader had for the follower by the time the read request > >>>>>> started. > >>>>>>>>>>> > >>>>>>>>>>> This implementation is a bit of a hack because it doesn't > follow > >>>>>> the same code path as the other operations that go to the leader, > but > >>> it > >>>>>> avoids some unnecessary steps, which is important for fast reads. In > >>> the > >>>>>> sync case, the other followers don't really need to know about it > >>> (there is > >>>>>> nothing to be updated) and the leader simply inserts it in the > >>> sequence of > >>>>>> updates of F, ordering it. > >>>>>>>>>>> > >>>>>>>>>>> -Flavio > >>>>>>>>>>> > >>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Flavio, > >>>>>>>>>>>> > >>>>>>>>>>>>> Starting a read operation concurrently with a sync implies > that > >>>>>> the result of the read will not miss an update committed before the > >>> read > >>>>>> started. > >>>>>>>>>>>> > >>>>>>>>>>>> I thought that the intention of sync was to give something > like > >>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, > your > >>>>>> read > >>>>>>>>>>>> is guaranteed to (at least) see any write which completed > before > >>>>>> the > >>>>>>>>>>>> sync began. Is this the intention ? If so, how is this > achieved > >>>>>>>>>>>> without running agreement on the sync op ? > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Alex > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira < > >>>>>> [email protected]> wrote: > >>>>>>>>>>>>> sync simply flushes the channel between the leader and the > >>>>>> follower that forwarded the sync operation, so it doesn't go through > >>> the > >>>>>> full zab pipeline. Flushing means that all pending updates from the > >>> leader > >>>>>> to the follower are received by the time sync completes. Starting a > >>> read > >>>>>> operation concurrently with a sync implies that the result of the > read > >>> will > >>>>>> not miss an update committed before the read started. > >>>>>>>>>>>>> > >>>>>>>>>>>>> -Flavio > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was > >>> always > >>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say - > >>>>>>>>>>>>>> you may trust your leader, but I may have a different leader > >>> and > >>>>>> your > >>>>>>>>>>>>>> leader may not detect it yet and still think its the leader. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This seems like a bug to me. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or > >>>>>> shouldn't) > >>>>>>>>>>>>>> depend on timing assumption. > >>>>>>>>>>>>>> Only progress guarantees depend on time. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Alex > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino < > >>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>> I have some pretty strong requirements in terms of > consistency > >>>>>> where > >>>>>>>>>>>>>>> reading from followers that may be behind in terms of > updates > >>>>>> isn't ok for > >>>>>>>>>>>>>>> my use case. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> One error case that worries me is if a follower and leader > are > >>>>>> partitioned > >>>>>>>>>>>>>>> off from the network. A new leader is elected, but the > >>>>>> follower and old > >>>>>>>>>>>>>>> leader don't know about it. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I > looked > >>>>>> at the sync > >>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the > leader > >>>>>> sends the > >>>>>>>>>>>>>>> sync right back to the client without first verifying that > it > >>>>>> still has > >>>>>>>>>>>>>>> quorum, so this won't work for my use case. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> At the core of the issue all I really need is a call that > will > >>>>>> make it's > >>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it > >>> still > >>>>>> has a > >>>>>>>>>>>>>>> quorum and return success. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be > >>>>>> forwarded to the > >>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return > it's > >>>>>> epoch. I > >>>>>>>>>>>>>>> can use this primitive to implement all the other > properties I > >>>>>> want to > >>>>>>>>>>>>>>> verify (assuming that my client will never connect to an > older > >>>>>> epoch after > >>>>>>>>>>>>>>> this call returns). Also the nice thing about this method > is > >>>>>> that it will > >>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round > >>>>>> trip to the > >>>>>>>>>>>>>>> followers. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based > an > >>>>>> rely on > >>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some > >>>>>> guarantees in > >>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and > any > >>>>>> other way > >>>>>>>>>>>>>>> time is broken. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Also if people are interested I can go into more detail > about > >>>>>> what I am > >>>>>>>>>>>>>>> trying to write. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -jc > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>> > >>> > > > >
