Re: Ensure Leader hasn't changed

John Carrino Fri, 28 Sep 2012 08:17:39 -0700

Yes. This property needs to be global because otherwise another transaction
can start and commit in the past when it shouldn't be allowed to.


-sent from my phone
On Sep 28, 2012 8:13 AM, "Flavio Junqueira" <[email protected]> wrote:

> I checked your message again and the contract is that getFreshTimestamp()
> always gets a fresh timestamp (not older than any returned by that method).
> One aspect of the contract that is not clear to me: does this freshness
> property hold across clients or only for a single client? If it holds only
> per client, then you can guarantee it with master epochs. No user of your
> application will connect to the master of epoch y < x once it connects to
> the master of epoch x. The epochs can be the sequence number of a
> sequential znode. How does it sound?
>
> -Flavio
>
> On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:
>
> > I was thinking that you can do a write per timestamp batch, and not per
> individual timestamp. In the worst case, a former leader won't use some
> timestamps, and I would think it is ok, but it depends on your application.
> >
> > Also, even if two clients believe they are leaders simultaneously and
> serve timestamps, the property that you seem to care about the most is
> uniqueness: a timestamp is not served twice. Order of timestamps would be
> preserved per leader, but in the case of overlapping leaders you could end
> up serving timestamps that do not follow an increasing order.
> >
> > -Flavio
> >
> > On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
> >
> >> CompareAndSwap is an atomic check and update. Basically only update the
> >> value if it is the same as the expected value.
> >>
> >> I think with your approach you'd have to do a write for every single
> >> timestamp you wanted to hand out.  The latency hit on this would too
> much.
> >>
> >> My approach is different in that a timestamp server reserves a bunch of
> >> timestamps up front and proceeds to hand them out as long as it is the
> >> leader.  Leader check can be done without hitting disk hopefully.
> >>
> >> Thanks!
> >>
> >> -jc
> >>
> >>
> >> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <[email protected]>
> wrote:
> >>
> >>> I don't know what your compareAndSwap method does, but I was wondering
> if
> >>> your client process can use conditional writes to a znode to make sure
> that
> >>> it was the last one to update the state of timestamp batches. You can
> treat
> >>> the master election problem separately and it does not have to be as
> strict
> >>> as you have been thinking you need. Thats is, it wouldn't hurt if a
> client
> >>> still thinks it is leading even if it is not because no two clients
> will be
> >>> able to update the state of timestamp blocks without noticing that
> another
> >>> client is also updating it.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
> >>>
> >>>> So I think it's time to explain what I'm writing just so everyone has
> >>>> more situation awareness.  Its just a timestamp server, nothing fancy.
> >>>>
> >>>> Looks like this:
> >>>>
> >>>> public interface TimestampService {
> >>>>  /**
> >>>>   * This will get a fresh timestamp that is guarenteed to be newer
> than
> >>>> any other timestamp
> >>>>   * handed out before this method was called.
> >>>>   */
> >>>>  long getFreshTimestamp();
> >>>> }
> >>>>
> >>>> The only requirement is that the timestamp handed back is greater than
> >>>> every other timestamp that was returned before getFreshTs was called.
> >>>> There is no ordering requirement for concurrent requests.
> >>>>
> >>>> My impl is to reserve blocks of timestamps that are safe to hand out
> (1M
> >>> at
> >>>> a time) using compare and swap in ZK.
> >>>> lastPossibleUsed = read(HighWater)
> >>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >>>>
> >>>> Now my leader can hand back timestamps up to safeToHandout, but
> before it
> >>>> hands one out it must ensure it is still the leader (no one else has
> >>> handed
> >>>> back something higher).
> >>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> >>>> case.  Now I have a service that is guarenteed to be correct, but
> doesn't
> >>>> require disk hits in the steady state which brings down my latency (if
> >>> you
> >>>> get close to running out, you can compareAndSwap for more timestamps).
> >>>>
> >>>> If many requests come in at the same time I can use smart batching to
> >>>> verify happens after for all at once.  We can also add more layers if
> we
> >>>> need more bandwidth to scale up at the cost of adding latency.
>  Basically
> >>>> our latency will be O(lg(requestRate)) if we keep adding layers as
> each
> >>>> previous layer becomes saturated.
> >>>>
> >>>> I hope this explanation helps. I am busy for the next 4 hours, but if
> you
> >>>> need more clarification I can respond to them at that time.
> >>>>
> >>>> -jc
> >>>>
> >>>>
> >>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <[email protected]
> >>>> wrote:
> >>>>
> >>>>> First, thanks everyone for talking this through with me.
> >>>>>
> >>>>> Flavio, for your example, this is actually ok.  There is a happens
> after
> >>>>> relationship between the client making the request and my leader C1
> >>> still
> >>>>> being the leader.  My service only needs to guarantee that what it
> hands
> >>>>> back is at least as new as anything that existed when the client made
> >>> the
> >>>>> request.  If C2 were to answer requests while C1 is stalling that is
> ok
> >>>>> because these would be considered concurrent requests and the stuff
> >>>>> returned by C2 may be newer but that doesn't violate any guarentees.
> >>>>>
> >>>>> If some client were to get back something from C2 and then (happens
> >>> after
> >>>>> relationship) someone tried to read from C1, it needs to fail.
> >>>>>
> >>>>> To address your concern of adding too much bandwidth we can get this
> >>>>> easily by doing what Martin Thompson calls smart batching (
> >>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
> ).
> >>>>>
> >>>>> 1. ensureQuorum request comes in to L1
> >>>>> 2. send ENSURE to all followers
> >>>>> 3. 10 more ensureQuorum requests come in
> >>>>> 4. get back ENSURE from quorum
> >>>>> 5. we can now service all 10 pending ensureQuorum requests with
> another
> >>>>> round trip ENSURE.
> >>>>>
> >>>>> We don't need to send an ENSURE for every ensureQuorum request, we
> just
> >>>>> need it to be happens after from when the request arrived.
> >>>>>
> >>>>> I am fine with the Ephemeral node being removed after some time
> expires,
> >>>>> but only by the leader.  If the leaders clock is broken and the
> client
> >>>>> owning the Ephemeral node drops off, then we don't have liveness
> >>> (because
> >>>>> this node may not get cleaned up in a timely fashion).  However, we
> >>> still
> >>>>> preserve corectness.
> >>>>>
> >>>>> -jc
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <[email protected]
> >>>> wrote:
> >>>>>
> >>>>>> Say that we implement what you're suggesting. Could you check if
> this
> >>>>>> scenario can happen:
> >>>>>>
> >>>>>> 1- Client C1 is the current leader and it super boosted read to make
> >>> sure
> >>>>>> it is still the leader;
> >>>>>> 2- We process the super boosted read having it through the zab
> >>> pipeline;
> >>>>>> 3- When we send the response to C1 we slow down the whole deal: the
> >>>>>> response to C1 gets delayed and we stall C1;
> >>>>>> 4- In the meanwhile, C1's session expires on the server side and its
> >>>>>> ephemeral leadership node is removed;
> >>>>>> 5- A new client C2 is elected and starts exercising leadership;
> >>>>>> 6- Now C1 comes back to normal and receives the response of the
> super
> >>>>>> boosted read saying that it is still the leader.
> >>>>>>
> >>>>>> If my interpretation is not incorrect, the only way to prevent this
> >>>>>> scenario from happening is if the session expires on the client side
> >>> before
> >>>>>> it receives the response of the read. It doesn't look like we can do
> >>> it if
> >>>>>> process clocks can be arbitrarily delayed.
> >>>>>>
> >>>>>> Note that one issue is that the behavior of ephemerals is highly
> >>>>>> dependent upon timers, so I don't think we can avoid making some
> timing
> >>>>>> assumptions altogether. The question is if we are better off with a
> >>>>>> mechanism relying upon acknowledgements. My sense is that
> >>> application-level
> >>>>>> fencing is preferable (if not necessary) for applications like the
> >>> ones JC
> >>>>>> is mentioning or BookKeeper.
> >>>>>>
> >>>>>> I'm not concerned about writes to disk, which I agree we don't need
> for
> >>>>>> sync. I'm more concerned about having it going through the whole
> >>> pipeline,
> >>>>>> which will induce more traffic to zab and increase latency for an
> >>>>>> application that uses it heavily.
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>>>>
> >>>>>>> another idea is to add this functionality to MultiOp - have read
> only
> >>>>>>> transactions be replicated but not logged or logged asynchronously.
> >>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
> >>>>>>> transaction - does it replicate the transaction or answer it
> locally
> >>>>>>> on the leader ?
> >>>>>>>
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
> [email protected]>
> >>>>>> wrote:
> >>>>>>>> Thanks for the explanation.
> >>>>>>>>
> >>>>>>>> I guess one could always invoke a write operation instead of sync
> to
> >>>>>>>> get the more strict semantics, but as John suggests, it might be a
> >>>>>>>> good idea to add a new type of operation that requires followers
> to
> >>>>>>>> ack but doesn't require them to log to disk - this seems
> sufficient
> >>> in
> >>>>>>>> our case.
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> [email protected]
> >>>>
> >>>>>> wrote:
> >>>>>>>>> In theory, the scenario you're describing could happen, but I
> would
> >>>>>> argue that it is unlikely given that: 1) a leader pings followers
> >>> twice a
> >>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
> >>> followers
> >>>>>> give up on a leader upon catching an exception (followLeader()). One
> >>> could
> >>>>>> calibrate tickTime to make the probability of having this scenario
> low.
> >>>>>>>>>
> >>>>>>>>> Let me also revisit the motivation for the way we designed sync.
> >>>>>> ZooKeeper has been designed to serve reads efficiently and making
> sync
> >>> go
> >>>>>> through the pipeline would slow down reads. Although optional, we
> >>> thought
> >>>>>> it would be a good idea to make it as efficient as possible to
> comply
> >>> with
> >>>>>> the original expectations for the service. We consequently came up
> with
> >>>>>> this cheap way of making sure that a read sees all pending updates.
> It
> >>> is
> >>>>>> correct that there are some corner cases that it doesn't cover. One
> is
> >>> the
> >>>>>> case you mentioned. Another is having the sync finishing before the
> >>> client
> >>>>>> submits the read and having a write committing in between. We rely
> >>> upon the
> >>>>>> way we implement timeouts and some minimum degree of synchrony for
> the
> >>>>>> clients when submitting operations to guarantee that the scheme
> work.
> >>>>>>>>>
> >>>>>>>>> We thought about the option of having the sync operation going
> >>>>>> through the pipeline, and in fact it would have been easier to
> >>> implement it
> >>>>>> just as a regular write, but we opted not to because we felt it was
> >>>>>> sufficient for the use cases we had and more efficient as I already
> >>> argued.
> >>>>>>>>>
> >>>>>>>>> Hope it helps to clarify.
> >>>>>>>>>
> >>>>>>>>> -Flavio
> >>>>>>>>>
> >>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>>>>>>>
> >>>>>>>>>> thanks for the explanation! but how do you avoid having the
> >>> scenario
> >>>>>>>>>> raised by John ?
> >>>>>>>>>> lets say you're a client connected to F, and F is connected to
> L.
> >>>>>> Lets
> >>>>>>>>>> also say that L's pipeline
> >>>>>>>>>> is now empty, and both F and L are partitioned from 3 other
> servers
> >>>>>> in
> >>>>>>>>>> the system that have already
> >>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
> >>> still
> >>>>>>>>>> thinks its the leader because the
> >>>>>>>>>> detection that followers left it is obviously timeout
> dependent. So
> >>>>>>>>>> when F sends your sync to L and L returns
> >>>>>>>>>> it to F, you actually miss my write!
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
> >>>>>> [email protected]> wrote:
> >>>>>>>>>>> Hi Alex, Because of the following:
> >>>>>>>>>>>
> >>>>>>>>>>> 1- A follower F processes operations from a client in FIFO
> order,
> >>>>>> and say that a client submits as you say sync + read;
> >>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
> >>>>>> follower. It will be queued after all pending updates that the
> follower
> >>>>>> hasn't processed;
> >>>>>>>>>>> 3- The follower will process all pending updates before
> processing
> >>>>>> the response of the sync;
> >>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
> >>>>>> operation to process. It reads the local state of the follower and
> >>> returns
> >>>>>> to the client.
> >>>>>>>>>>>
> >>>>>>>>>>> When we process the read in Step 4, we have applied all pending
> >>>>>> updates the leader had for the follower by the time the read request
> >>>>>> started.
> >>>>>>>>>>>
> >>>>>>>>>>> This implementation is a bit of a hack because it doesn't
> follow
> >>>>>> the same code path as the other operations that go to the leader,
> but
> >>> it
> >>>>>> avoids some unnecessary steps, which is important for fast reads. In
> >>> the
> >>>>>> sync case, the other followers don't really need to know about it
> >>> (there is
> >>>>>> nothing to be updated) and the leader simply inserts it in the
> >>> sequence of
> >>>>>> updates of F, ordering it.
> >>>>>>>>>>>
> >>>>>>>>>>> -Flavio
> >>>>>>>>>>>
> >>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Flavio,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Starting a read operation concurrently with a sync implies
> that
> >>>>>> the result of the read will not miss an update committed before the
> >>> read
> >>>>>> started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I thought that the intention of sync was to give something
> like
> >>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read,
> your
> >>>>>> read
> >>>>>>>>>>>> is guaranteed to (at least) see any write which completed
> before
> >>>>>> the
> >>>>>>>>>>>> sync began. Is this the intention ? If so, how is this
> achieved
> >>>>>>>>>>>> without running agreement on the sync op ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Alex
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> >>>>>> [email protected]> wrote:
> >>>>>>>>>>>>> sync simply flushes the channel between the leader and the
> >>>>>> follower that forwarded the sync operation, so it doesn't go through
> >>> the
> >>>>>> full zab pipeline. Flushing means that all pending updates from the
> >>> leader
> >>>>>> to the follower are received by the time sync completes. Starting a
> >>> read
> >>>>>> operation concurrently with a sync implies that the result of the
> read
> >>> will
> >>>>>> not miss an update committed before the read started.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>> always
> >>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
> >>> and
> >>>>>> your
> >>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This seems like a bug to me.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>>>>> shouldn't)
> >>>>>>>>>>>>>> depend on timing assumption.
> >>>>>>>>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> >>>>>> [email protected]> wrote:
> >>>>>>>>>>>>>>> I have some pretty strong requirements in terms of
> consistency
> >>>>>> where
> >>>>>>>>>>>>>>> reading from followers that may be behind in terms of
> updates
> >>>>>> isn't ok for
> >>>>>>>>>>>>>>> my use case.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One error case that worries me is if a follower and leader
> are
> >>>>>> partitioned
> >>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
> >>>>>> follower and old
> >>>>>>>>>>>>>>> leader don't know about it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I
> looked
> >>>>>> at the sync
> >>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the
> leader
> >>>>>> sends the
> >>>>>>>>>>>>>>> sync right back to the client without first verifying that
> it
> >>>>>> still has
> >>>>>>>>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> At the core of the issue all I really need is a call that
> will
> >>>>>> make it's
> >>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
> >>> still
> >>>>>> has a
> >>>>>>>>>>>>>>> quorum and return success.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>>>>> forwarded to the
> >>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return
> it's
> >>>>>> epoch.  I
> >>>>>>>>>>>>>>> can use this primitive to implement all the other
> properties I
> >>>>>> want to
> >>>>>>>>>>>>>>> verify (assuming that my client will never connect to an
> older
> >>>>>> epoch after
> >>>>>>>>>>>>>>> this call returns). Also the nice thing about this method
> is
> >>>>>> that it will
> >>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
> >>>>>> trip to the
> >>>>>>>>>>>>>>> followers.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based
> an
> >>>>>> rely on
> >>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>>>>> guarantees in
> >>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
> any
> >>>>>> other way
> >>>>>>>>>>>>>>> time is broken.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Also if people are interested I can go into more detail
> about
> >>>>>> what I am
> >>>>>>>>>>>>>>> trying to write.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -jc
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>
>

Re: Ensure Leader hasn't changed

Reply via email to