Seems like you can have a simpler mechanism.

With ephemeral nodes (I don't think that sequential ephemeral is strictly
needed) nobody is going to become leader while that ephemeral file exists.
 Thus, if the master has not received a disconnect notification, it will be
at least several 10's of seconds before somebody else can become leader.
 If you know that the master won't skip around in time, then you should be
safe.  This guarantee might be violated if the master is on a VM that is
paused at the critical moment, but I think your other implementation has
that same problem.

On Thu, Sep 27, 2012 at 2:41 AM, John Carrino <[email protected]>wrote:

> Basically I am doing leader election using ZK with sequential ephemeral
> nodes.  I want a guarenteed way to ensure that my ephemeral node still
> exists (no other leader has done work).  Let's call my elected leader L1.
> L1 may serve a request if it is the leader at the time the request was
> made.  L1 may lose leadership during the request or after it has
> responded.  I only need a happens after relationship between the server
> getting the request and checking getCurrentLeaderEpoch() and doing a read
> to ensure L1 still has the lowest seq/ephemeral node (no disk writes
> needed).
>
> Normally when people think of locking they assume that the lock must be
> held throughout the entire request.  In this way distrubuted locking is
> "hard" or it might even be impossible (I haven't really looked into it
> formally).  I don't need the lock for the duration of the request to ensure
> correctness in my system.  I only require that the leader still held it's
> lock some time "after" the request was initiated.
>
> I most likely could use ZK as is and won't hit any bugs from this, but I am
> kinda OCD when it comes to building this type of infrastructure.
>
> I think of this more as a feature request than a bug because from reading
> up it seems like ZK gives you Reliable, Total order and Causal message
> delivery.  If I were to do a write request, then a sync, then check my node
> still exists would have the property I desire.  However I don't want to
> take the perf hit of doing a write.
>
> Thanks!
>
> -jc
>
>
> On Wed, Sep 26, 2012 at 6:43 PM, Alexander Shraer <[email protected]>
> wrote:
>
> > Its strange that sync doesn't run through agreement, I was always
> > assuming that it is... Exactly for the reason you say -
> > you may trust your leader, but I may have a different leader and your
> > leader may not detect it yet and still think its the leader.
> >
> > This seems like a bug to me.
> >
> > Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
> > depend on timing assumption.
> > Only progress guarantees depend on time.
> >
> > Alex
> >
> >
> > On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <[email protected]>
> > wrote:
> > > I have some pretty strong requirements in terms of consistency where
> > > reading from followers that may be behind in terms of updates isn't ok
> > for
> > > my use case.
> > >
> > > One error case that worries me is if a follower and leader are
> > partitioned
> > > off from the network.  A new leader is elected, but the follower and
> old
> > > leader don't know about it.
> > >
> > > Normally I think sync was made for this purpost, but I looked at the
> sync
> > > code and if there aren't any outstanding proposals the leader sends the
> > > sync right back to the client without first verifying that it still has
> > > quorum, so this won't work for my use case.
> > >
> > > At the core of the issue all I really need is a call that will make
> it's
> > > way to the leader and will ping it's followers, ensure it still has a
> > > quorum and return success.
> > >
> > > Basically a getCurrentLeaderEpoch() method that will be forwarded to
> the
> > > leader, leader will ensure it still has quorum and return it's epoch.
>  I
> > > can use this primitive to implement all the other properties I want to
> > > verify (assuming that my client will never connect to an older epoch
> > after
> > > this call returns). Also the nice thing about this method is that it
> will
> > > not have to hit disk and the latency should just be a round trip to the
> > > followers.
> > >
> > > Most of the guarentees offered by zookeeper are time based an rely on
> > > clocks and expiring timers, but I'm hoping to offer some guarantees in
> > > spite of busted clocks, horrible GC perf, VM suspends and any other way
> > > time is broken.
> > >
> > > Also if people are interested I can go into more detail about what I am
> > > trying to write.
> > >
> > > -jc
> >
>

Reply via email to