Re: Leader election and leader operation based on zookeeper

2019-10-01 Thread Jordan Zimmerman
Yes, I think this is a hole. As I've thought more about it I think the method 
you described using the lock node in the transaction is actually the best.

-JZ

> On Sep 29, 2019, at 11:41 PM, Zili Chen  wrote:
> 
> Hi Jordan,
> 
> Here is a possible edge case of coordination node way.
> 
> When an instance becomes leader it:
> Gets the version of the coordination ZNode
> Sets the data for that ZNode (the contents don't matter) using the retrieved 
> version number
> If the set succeeds you can be assured you are currently leader (otherwise 
> release leadership and re-contend)
> Save the new version
> 
> Actually, it is NOT atomic that an instance becomes leader and it gets the 
> version of the coordination znode. So an edge case is,
> 
> 1. instance-1 becomes leader, trying to get the version of the coordination 
> znode.
> 2. instance-2 becomes leader, update the coordination znode.
> 3. instance-1 gets the newer version and re-update the coordination znode.
> 
> Generally speaking instance-1 suffers session expire but since Curator 
> retries on session expire that cases above is possible. Although
> instance-2 will be mislead that itself not the leader and give up leadership 
> so that the algorithm can proceed and instance-1 will be
> asynchronously notified it is not the leader, before the notification 
> instance-1 possibly performs some operations already.
> 
> Curator should ensure that instance-1 will not regard itself as the leader 
> with some synchronize logic. Or just use a cached leader latch path
> for checking because the leader latch path when it becomes leader is 
> synchronized to be the exact one. To be more clear, for leader latch
> path, I don't mean the volatile field, but the one cached when it becomes 
> leader.
> 
> Best,
> tison.
> 
> 
> Zili Chen mailto:wander4...@gmail.com>> 于2019年9月22日周日 
> 上午2:43写道:
> >the Curator recipes delete and recreate their paths
> 
> However, as mentioned above, we do a one-shot election(doesn't reuse the 
> curator recipe) so that
> we check the latch path is always the path in the epoch the contender becomes 
> leader. You can check
> out an implementation of the design here[1]. Even we want to enable 
> re-contending we can set a guard
> 
> (change state -> track latch path)
> 
> and check the state in LEADING && path existence. ( so we don't misleading 
> and check a wrong path )
> 
> Checking version and a coordinate znode sounds another valid solution. I'm 
> glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a bit :-)
> 
> Best,
> tison.
> 
> [1] 
> https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java
>  
> 
> 
> Jordan Zimmerman  > 于2019年9月22日周日 上午2:31写道:
> The issue is that the leader path doesn't stay constant. Every time there is 
> a network partition, etc. the Curator recipes delete and recreate their 
> paths. So, I'm concerned that client code trying to keep track of the leader 
> path would be error prone (it's one reason that they aren't public - it's 
> volatile internal state).
> 
> -Jordan
> 
>> On Sep 21, 2019, at 1:26 PM, Zili Chen > > wrote:
>> 
>> Hi Jordan,
>> 
>> >I think using the leader path may not work
>> 
>> could you share a situation where this strategy does not work? For the 
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch 
>> path && in state LEADING.
>> 
>> Given the election algorithm works, state transited to LEADING when its 
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding that 
>> nobody else becoming leader.
>> 
>> 
>> Jordan Zimmerman > > 于2019年9月22日周日 上午12:58写道:
>> Yeah, Ted - I think this is basically the same thing. We should all try to 
>> poke holes in this.
>> 
>> -JZ
>> 
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning >> > wrote:
>>> 
>>> 
>>> I would suggest that using an epoch number stored in ZK might be helpful. 
>>> Every operation that the master takes could be made conditional on the 
>>> epoch number using a multi-transaction.
>>> 
>>> Unfortunately, as you say, you have to have the update of the epoch be 
>>> atomic with becoming leader. 
>>> 
>>> The natural way to do this is to have an update of an epoch file be part of 
>>> the leader election, but that probably isn't possible using Curator. The 
>>> way I would tend to do it would be have a persistent file that is updated 
>>> atomically as part of leader election. The version of that persistent file 
>>> could then be used as the epoch 

Re: Leader election and leader operation based on zookeeper

2019-09-29 Thread Zili Chen
Hi Jordan,

Here is a possible edge case of coordination node way.


   - When an instance becomes leader it:
  - Gets the version of the coordination ZNode
  - Sets the data for that ZNode (the contents don't matter) using the
  retrieved version number
  - If the set succeeds you can be assured you are currently leader
  (otherwise release leadership and re-contend)
  - Save the new version


Actually, it is NOT atomic that an instance becomes leader and it gets the
version of the coordination znode. So an edge case is,

1. instance-1 becomes leader, trying to get the version of the coordination
znode.
2. instance-2 becomes leader, update the coordination znode.
3. instance-1 gets the newer version and re-update the coordination znode.

Generally speaking instance-1 suffers session expire but since Curator
retries on session expire that cases above is possible. Although
instance-2 will be mislead that itself not the leader and give up
leadership so that the algorithm can proceed and instance-1 will be
asynchronously notified it is not the leader, before the notification
instance-1 possibly performs some operations already.

Curator should ensure that instance-1 will not regard itself as the leader
with some synchronize logic. Or just use a cached leader latch path
for checking because the leader latch path when it becomes leader is
synchronized to be the exact one. To be more clear, for leader latch
path, I don't mean the volatile field, but the one cached when it becomes
leader.

Best,
tison.


Zili Chen  于2019年9月22日周日 上午2:43写道:

> >the Curator recipes delete and recreate their paths
>
> However, as mentioned above, we do a one-shot election(doesn't reuse the
> curator recipe) so that
> we check the latch path is always the path in the epoch the contender
> becomes leader. You can check
> out an implementation of the design here[1]. Even we want to enable
> re-contending we can set a guard
>
> (change state -> track latch path)
>
> and check the state in LEADING && path existence. ( so we don't misleading
> and check a wrong path )
>
> Checking version and a coordinate znode sounds another valid solution. I'm
> glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a bit
> :-)
>
> Best,
> tison.
>
> [1]
> https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java
>
>
> Jordan Zimmerman  于2019年9月22日周日 上午2:31写道:
>
>> The issue is that the leader path doesn't stay constant. Every time there
>> is a network partition, etc. the Curator recipes delete and recreate their
>> paths. So, I'm concerned that client code trying to keep track of the
>> leader path would be error prone (it's one reason that they aren't public -
>> it's volatile internal state).
>>
>> -Jordan
>>
>> On Sep 21, 2019, at 1:26 PM, Zili Chen  wrote:
>>
>> Hi Jordan,
>>
>> >I think using the leader path may not work
>>
>> could you share a situation where this strategy does not work? For the
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch
>> path && in state LEADING.
>>
>> Given the election algorithm works, state transited to LEADING when its
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding
>> that nobody else becoming leader.
>>
>>
>> Jordan Zimmerman  于2019年9月22日周日 上午12:58写道:
>>
>>> Yeah, Ted - I think this is basically the same thing. We should all try
>>> to poke holes in this.
>>>
>>> -JZ
>>>
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning  wrote:
>>>
>>>
>>> I would suggest that using an epoch number stored in ZK might be
>>> helpful. Every operation that the master takes could be made conditional on
>>> the epoch number using a multi-transaction.
>>>
>>> Unfortunately, as you say, you have to have the update of the epoch be
>>> atomic with becoming leader.
>>>
>>> The natural way to do this is to have an update of an epoch file be part
>>> of the leader election, but that probably isn't possible using Curator. The
>>> way I would tend to do it would be have a persistent file that is updated
>>> atomically as part of leader election. The version of that persistent file
>>> could then be used as the epoch number. All updates to files that are gated
>>> on the epoch number would only proceed if no other master has been elected,
>>> at least if you use the sync option.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  wrote:
>>>
 Hi ZooKeepers,

 Recently there is an ongoing refactor[1] in Flink community aimed at
 overcoming several inconsistent state issues on ZK we have met. I come
 here to share our design of leader election and leader operation. For
 leader operation, it is operation that should be committed only if the
 contender is the leader. Also CC Curator mailing list because it 

Re: Leader election and leader operation based on zookeeper

2019-09-21 Thread Zili Chen
>the Curator recipes delete and recreate their paths

However, as mentioned above, we do a one-shot election(doesn't reuse the
curator recipe) so that
we check the latch path is always the path in the epoch the contender
becomes leader. You can check
out an implementation of the design here[1]. Even we want to enable
re-contending we can set a guard

(change state -> track latch path)

and check the state in LEADING && path existence. ( so we don't misleading
and check a wrong path )

Checking version and a coordinate znode sounds another valid solution. I'm
glad to see it in the future
Curator version and if there is a valid ticket I can help to dig out a bit
:-)

Best,
tison.

[1]
https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java


Jordan Zimmerman  于2019年9月22日周日 上午2:31写道:

> The issue is that the leader path doesn't stay constant. Every time there
> is a network partition, etc. the Curator recipes delete and recreate their
> paths. So, I'm concerned that client code trying to keep track of the
> leader path would be error prone (it's one reason that they aren't public -
> it's volatile internal state).
>
> -Jordan
>
> On Sep 21, 2019, at 1:26 PM, Zili Chen  wrote:
>
> Hi Jordan,
>
> >I think using the leader path may not work
>
> could you share a situation where this strategy does not work? For the
> design we do leader contending
> one-shot and when perform a transaction, checking the existence of latch
> path && in state LEADING.
>
> Given the election algorithm works, state transited to LEADING when its
> latch path once became
> the smallest sequential znode. So the existence of latch path guarding
> that nobody else becoming leader.
>
>
> Jordan Zimmerman  于2019年9月22日周日 上午12:58写道:
>
>> Yeah, Ted - I think this is basically the same thing. We should all try
>> to poke holes in this.
>>
>> -JZ
>>
>> On Sep 21, 2019, at 11:54 AM, Ted Dunning  wrote:
>>
>>
>> I would suggest that using an epoch number stored in ZK might be helpful.
>> Every operation that the master takes could be made conditional on the
>> epoch number using a multi-transaction.
>>
>> Unfortunately, as you say, you have to have the update of the epoch be
>> atomic with becoming leader.
>>
>> The natural way to do this is to have an update of an epoch file be part
>> of the leader election, but that probably isn't possible using Curator. The
>> way I would tend to do it would be have a persistent file that is updated
>> atomically as part of leader election. The version of that persistent file
>> could then be used as the epoch number. All updates to files that are gated
>> on the epoch number would only proceed if no other master has been elected,
>> at least if you use the sync option.
>>
>>
>>
>>
>>
>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  wrote:
>>
>>> Hi ZooKeepers,
>>>
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>>
>>> The rule we want to keep is
>>>
>>> **Writes on ZK must be committed only if the contender is the leader**
>>>
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>>
>>> The problem is that this algorithm only take care of leader election but
>>> is indifferent to subsequent operations. Consider the scenario below:
>>>
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>>
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>>
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>>
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched if the latch
>>> gone, and all operations will fail if the latch gone.
>>>
>>> The second is still rough. Basically it relies on session expire
>>> mechanism in ZK. We will adopt the unoptimized version in the
>>> recipe[2] given that in our scenario there are only few contenders

Re: Leader election and leader operation based on zookeeper

2019-09-21 Thread Jordan Zimmerman
Thinking more about this... I imagine this works if the current leader path is 
always used. I need to think about this some more. 

-JZ

> On Sep 21, 2019, at 1:31 PM, Jordan Zimmerman  
> wrote:
> 
> The issue is that the leader path doesn't stay constant. Every time there is 
> a network partition, etc. the Curator recipes delete and recreate their 
> paths. So, I'm concerned that client code trying to keep track of the leader 
> path would be error prone (it's one reason that they aren't public - it's 
> volatile internal state).
> 
> -Jordan
> 
>> On Sep 21, 2019, at 1:26 PM, Zili Chen > > wrote:
>> 
>> Hi Jordan,
>> 
>> >I think using the leader path may not work
>> 
>> could you share a situation where this strategy does not work? For the 
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch 
>> path && in state LEADING.
>> 
>> Given the election algorithm works, state transited to LEADING when its 
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding that 
>> nobody else becoming leader.
>> 
>> 
>> Jordan Zimmerman > > 于2019年9月22日周日 上午12:58写道:
>> Yeah, Ted - I think this is basically the same thing. We should all try to 
>> poke holes in this.
>> 
>> -JZ
>> 
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning >> > wrote:
>>> 
>>> 
>>> I would suggest that using an epoch number stored in ZK might be helpful. 
>>> Every operation that the master takes could be made conditional on the 
>>> epoch number using a multi-transaction.
>>> 
>>> Unfortunately, as you say, you have to have the update of the epoch be 
>>> atomic with becoming leader. 
>>> 
>>> The natural way to do this is to have an update of an epoch file be part of 
>>> the leader election, but that probably isn't possible using Curator. The 
>>> way I would tend to do it would be have a persistent file that is updated 
>>> atomically as part of leader election. The version of that persistent file 
>>> could then be used as the epoch number. All updates to files that are gated 
>>> on the epoch number would only proceed if no other master has been elected, 
>>> at least if you use the sync option.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen >> > wrote:
>>> Hi ZooKeepers,
>>> 
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>> 
>>> The rule we want to keep is
>>> 
>>> **Writes on ZK must be committed only if the contender is the leader**
>>> 
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>> 
>>> The problem is that this algorithm only take care of leader election but
>>> is indifferent to subsequent operations. Consider the scenario below:
>>> 
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>> 
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>> 
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>> 
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched if the latch
>>> gone, and all operations will fail if the latch gone.
>>> 
>>> The second is still rough. Basically it relies on session expire
>>> mechanism in ZK. We will adopt the unoptimized version in the
>>> recipe[2] given that in our scenario there are only few contenders
>>> at the same time. Thus we create /leader node as ephemeral znode with
>>> leader information and when session expired we think leadership is
>>> revoked and terminate the contender. Asynchronous write operations
>>> should not succeed because they will all fail on session expire.
>>> 
>>> We cannot adopt 1 using Curator because it doesn't expose the latch
>>> path(which is added recently, but not in the version we use); we
>>> cannot adopt 2 using Curator because although we have to retry on
>>> 

Re: Leader election and leader operation based on zookeeper

2019-09-21 Thread Jordan Zimmerman
The issue is that the leader path doesn't stay constant. Every time there is a 
network partition, etc. the Curator recipes delete and recreate their paths. 
So, I'm concerned that client code trying to keep track of the leader path 
would be error prone (it's one reason that they aren't public - it's volatile 
internal state).

-Jordan

> On Sep 21, 2019, at 1:26 PM, Zili Chen  wrote:
> 
> Hi Jordan,
> 
> >I think using the leader path may not work
> 
> could you share a situation where this strategy does not work? For the design 
> we do leader contending
> one-shot and when perform a transaction, checking the existence of latch path 
> && in state LEADING.
> 
> Given the election algorithm works, state transited to LEADING when its latch 
> path once became
> the smallest sequential znode. So the existence of latch path guarding that 
> nobody else becoming leader.
> 
> 
> Jordan Zimmerman  > 于2019年9月22日周日 上午12:58写道:
> Yeah, Ted - I think this is basically the same thing. We should all try to 
> poke holes in this.
> 
> -JZ
> 
>> On Sep 21, 2019, at 11:54 AM, Ted Dunning > > wrote:
>> 
>> 
>> I would suggest that using an epoch number stored in ZK might be helpful. 
>> Every operation that the master takes could be made conditional on the epoch 
>> number using a multi-transaction.
>> 
>> Unfortunately, as you say, you have to have the update of the epoch be 
>> atomic with becoming leader. 
>> 
>> The natural way to do this is to have an update of an epoch file be part of 
>> the leader election, but that probably isn't possible using Curator. The way 
>> I would tend to do it would be have a persistent file that is updated 
>> atomically as part of leader election. The version of that persistent file 
>> could then be used as the epoch number. All updates to files that are gated 
>> on the epoch number would only proceed if no other master has been elected, 
>> at least if you use the sync option.
>> 
>> 
>> 
>> 
>> 
>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen > > wrote:
>> Hi ZooKeepers,
>> 
>> Recently there is an ongoing refactor[1] in Flink community aimed at
>> overcoming several inconsistent state issues on ZK we have met. I come
>> here to share our design of leader election and leader operation. For
>> leader operation, it is operation that should be committed only if the
>> contender is the leader. Also CC Curator mailing list because it also
>> contains the reason why we cannot JUST use Curator.
>> 
>> The rule we want to keep is
>> 
>> **Writes on ZK must be committed only if the contender is the leader**
>> 
>> We represent contender by an individual ZK client. At the moment we use
>> Curator for leader election so the algorithm is the same as the
>> optimized version in this page[2].
>> 
>> The problem is that this algorithm only take care of leader election but
>> is indifferent to subsequent operations. Consider the scenario below:
>> 
>> 1. contender-1 becomes the leader
>> 2. contender-1 proposes a create txn-1
>> 3. sender thread suspended for full gc
>> 4. contender-1 lost leadership and contender-2 becomes the leader
>> 5. contender-1 recovers from full gc, before it reacts to revoke
>> leadership event, txn-1 retried and sent to ZK.
>> 
>> Without other guard txn will success on ZK and thus contender-1 commit
>> a write operation even if it is no longer the leader. This issue is
>> also documented in this note[3].
>> 
>> To overcome this issue instead of just saying that we're unfortunate,
>> we draft two possible solution.
>> 
>> The first is document here[4]. Briefly, when the contender becomes the
>> leader, we memorize the latch path at that moment. And for
>> subsequent operations, we do in a transaction first checking the
>> existence of the latch path. Leadership is only switched if the latch
>> gone, and all operations will fail if the latch gone.
>> 
>> The second is still rough. Basically it relies on session expire
>> mechanism in ZK. We will adopt the unoptimized version in the
>> recipe[2] given that in our scenario there are only few contenders
>> at the same time. Thus we create /leader node as ephemeral znode with
>> leader information and when session expired we think leadership is
>> revoked and terminate the contender. Asynchronous write operations
>> should not succeed because they will all fail on session expire.
>> 
>> We cannot adopt 1 using Curator because it doesn't expose the latch
>> path(which is added recently, but not in the version we use); we
>> cannot adopt 2 using Curator because although we have to retry on
>> connection loss but we don't want to retry on session expire. Curator
>> always creates a new client on session expire and retry the operation.
>> 
>> I'd like to learn from ZooKeeper community that 1. is there any
>> potential risk if we eventually adopt option 1 or option 2? 2. is
>> there any other solution we can adopt?
>> 
>> Best,
>> tison.

Re: Leader election and leader operation based on zookeeper

2019-09-20 Thread Zili Chen
Hi Jordan,

Thanks for your pointing out. However, I'm not clear about lock strategy of
Curator.

Is it possible that getZookeeperClient().getZooKeeper() concurrent with a
session
expire and re-instance ZK client(thus I get the wrong session id)?

Furthermore, even if I get the session id, check it is the same as I am
granted leadership,
I perform the operation, and Curator still possibly retry on operation
fails on session expire.

Best,
tison.


Jordan Zimmerman  于2019年9月21日周六 上午11:27写道:

> It seems Curator does not expose session id
>
>
> you can always access the ZooKeeper handle directly to get the session ID:
>
> CuratorFramework curator = ...
> curator.getZookeeperClient().getZooKeeper()
>
> -JZ
>
> On Sep 20, 2019, at 10:21 PM, Zili Chen  wrote:
>
> >>I am assuming the "write operation" here is write to ZooKeeper
>
> Yes.
>
> >>Looks like contender-1 was not reusing same ZooKeeper client object, so
> this explains how the previous supposed to be fail operation succeeds?
>
> Yes. Our communication to ZK is based on Curator, which will re-instance a
> client and retry the operation. Due to asynchronously schedule the error
> execute order is possible.
>
> >>record the session ID and don't commit any write operations if session
> ID changes.
>
> Sounds reasonable. Currently in our ongoing design we treat the latch path
> as "session id" so we use multi-op to atomically verify it.
> It seems Curator does not expose session id. And in my option 2 above even
> I think of falling back to zookeeper so that we just fail on
> session expired and re-instance another contender, contending for
> leadership. This will save us from maintaining mutable state during
> leadership epoch(to be clear, Flink scope leadership, not ZK).
>
> Best,
> tison.
>
>
> Michael Han  于2019年9月21日周六 上午4:03写道:
>
>> >> thus contender-1 commit a write operation even if it is no longer the
>> leader
>>
>> I am assuming the "write operation" here is write to ZooKeeper (as
>> opposed to write to an external storage system)? If so:
>>
>> >> contender-1 recovers from full gc, before it reacts to revoke
>> leadership event, txn-1 retried and sent to ZK.
>>
>> contender-2 becomes the leader implies that the ephemeral node appertains
>> to contender-1 has been removed, which further implies that the session
>> appertains to contender-1 is either explicitly closed (by client), or
>> expired. So if contender-1 was still using same client ZooKeeper object,
>> then it's not possible for txn-1 succeeded as session expire was an event
>> ordered prior to txn-1, which wouldn't commit after an expired session.
>>
>> >> Curator always creates a new client on session expire and retry the
>> operation.
>> Looks like contender-1 was not reusing same ZooKeeper client object, so
>> this explains how the previous supposed to be fail operation succeeds?
>>
>> If my reasoning make sense, one idea might be on Flink side, once you
>> finish leader election with ZK, record the session ID and don't commit any
>> write operations if session ID changes.
>>
>> The fencing token + multi might also work, but that sounds a little bit
>> heavier.
>>
>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  wrote:
>>
>>> Hi ZooKeepers,
>>>
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>>
>>> The rule we want to keep is
>>>
>>> **Writes on ZK must be committed only if the contender is the leader**
>>>
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>>
>>> The problem is that this algorithm only take care of leader election but
>>> is indifferent to subsequent operations. Consider the scenario below:
>>>
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>>
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>>
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>>
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched 

Re: Leader election and leader operation based on zookeeper

2019-09-20 Thread Jordan Zimmerman
> It seems Curator does not expose session id

you can always access the ZooKeeper handle directly to get the session ID:

CuratorFramework curator = ...
curator.getZookeeperClient().getZooKeeper()

-JZ

> On Sep 20, 2019, at 10:21 PM, Zili Chen  wrote:
> 
> >>I am assuming the "write operation" here is write to ZooKeeper
> 
> Yes.
> 
> >>Looks like contender-1 was not reusing same ZooKeeper client object, so 
> >>this explains how the previous supposed to be fail operation succeeds?
> 
> Yes. Our communication to ZK is based on Curator, which will re-instance a 
> client and retry the operation. Due to asynchronously schedule the error 
> execute order is possible.
> 
> >>record the session ID and don't commit any write operations if session ID 
> >>changes.
> 
> Sounds reasonable. Currently in our ongoing design we treat the latch path as 
> "session id" so we use multi-op to atomically verify it.
> It seems Curator does not expose session id. And in my option 2 above even I 
> think of falling back to zookeeper so that we just fail on
> session expired and re-instance another contender, contending for leadership. 
> This will save us from maintaining mutable state during
> leadership epoch(to be clear, Flink scope leadership, not ZK).
> 
> Best,
> tison.
> 
> 
> Michael Han mailto:h...@apache.org>> 于2019年9月21日周六 上午4:03写道:
> >> thus contender-1 commit a write operation even if it is no longer the 
> >> leader
> 
> I am assuming the "write operation" here is write to ZooKeeper (as opposed to 
> write to an external storage system)? If so:
> 
> >> contender-1 recovers from full gc, before it reacts to revoke leadership 
> >> event, txn-1 retried and sent to ZK.
> 
> contender-2 becomes the leader implies that the ephemeral node appertains to 
> contender-1 has been removed, which further implies that the session 
> appertains to contender-1 is either explicitly closed (by client), or 
> expired. So if contender-1 was still using same client ZooKeeper object, then 
> it's not possible for txn-1 succeeded as session expire was an event ordered 
> prior to txn-1, which wouldn't commit after an expired session.
> 
> >> Curator always creates a new client on session expire and retry the 
> >> operation.
> Looks like contender-1 was not reusing same ZooKeeper client object, so this 
> explains how the previous supposed to be fail operation succeeds?
> 
> If my reasoning make sense, one idea might be on Flink side, once you finish 
> leader election with ZK, record the session ID and don't commit any write 
> operations if session ID changes.
> 
> The fencing token + multi might also work, but that sounds a little bit 
> heavier. 
> 
> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  > wrote:
> Hi ZooKeepers,
> 
> Recently there is an ongoing refactor[1] in Flink community aimed at
> overcoming several inconsistent state issues on ZK we have met. I come
> here to share our design of leader election and leader operation. For
> leader operation, it is operation that should be committed only if the
> contender is the leader. Also CC Curator mailing list because it also
> contains the reason why we cannot JUST use Curator.
> 
> The rule we want to keep is
> 
> **Writes on ZK must be committed only if the contender is the leader**
> 
> We represent contender by an individual ZK client. At the moment we use
> Curator for leader election so the algorithm is the same as the
> optimized version in this page[2].
> 
> The problem is that this algorithm only take care of leader election but
> is indifferent to subsequent operations. Consider the scenario below:
> 
> 1. contender-1 becomes the leader
> 2. contender-1 proposes a create txn-1
> 3. sender thread suspended for full gc
> 4. contender-1 lost leadership and contender-2 becomes the leader
> 5. contender-1 recovers from full gc, before it reacts to revoke
> leadership event, txn-1 retried and sent to ZK.
> 
> Without other guard txn will success on ZK and thus contender-1 commit
> a write operation even if it is no longer the leader. This issue is
> also documented in this note[3].
> 
> To overcome this issue instead of just saying that we're unfortunate,
> we draft two possible solution.
> 
> The first is document here[4]. Briefly, when the contender becomes the
> leader, we memorize the latch path at that moment. And for
> subsequent operations, we do in a transaction first checking the
> existence of the latch path. Leadership is only switched if the latch
> gone, and all operations will fail if the latch gone.
> 
> The second is still rough. Basically it relies on session expire
> mechanism in ZK. We will adopt the unoptimized version in the
> recipe[2] given that in our scenario there are only few contenders
> at the same time. Thus we create /leader node as ephemeral znode with
> leader information and when session expired we think leadership is
> revoked and terminate the contender. Asynchronous write operations
> should not succeed 

Re: Leader election and leader operation based on zookeeper

2019-09-20 Thread Zili Chen
>>I am assuming the "write operation" here is write to ZooKeeper

Yes.

>>Looks like contender-1 was not reusing same ZooKeeper client object, so
this explains how the previous supposed to be fail operation succeeds?

Yes. Our communication to ZK is based on Curator, which will re-instance a
client and retry the operation. Due to asynchronously schedule the error
execute order is possible.

>>record the session ID and don't commit any write operations if session ID
changes.

Sounds reasonable. Currently in our ongoing design we treat the latch path
as "session id" so we use multi-op to atomically verify it.
It seems Curator does not expose session id. And in my option 2 above even
I think of falling back to zookeeper so that we just fail on
session expired and re-instance another contender, contending for
leadership. This will save us from maintaining mutable state during
leadership epoch(to be clear, Flink scope leadership, not ZK).

Best,
tison.


Michael Han  于2019年9月21日周六 上午4:03写道:

> >> thus contender-1 commit a write operation even if it is no longer the
> leader
>
> I am assuming the "write operation" here is write to ZooKeeper (as opposed
> to write to an external storage system)? If so:
>
> >> contender-1 recovers from full gc, before it reacts to revoke
> leadership event, txn-1 retried and sent to ZK.
>
> contender-2 becomes the leader implies that the ephemeral node appertains
> to contender-1 has been removed, which further implies that the session
> appertains to contender-1 is either explicitly closed (by client), or
> expired. So if contender-1 was still using same client ZooKeeper object,
> then it's not possible for txn-1 succeeded as session expire was an event
> ordered prior to txn-1, which wouldn't commit after an expired session.
>
> >> Curator always creates a new client on session expire and retry the
> operation.
> Looks like contender-1 was not reusing same ZooKeeper client object, so
> this explains how the previous supposed to be fail operation succeeds?
>
> If my reasoning make sense, one idea might be on Flink side, once you
> finish leader election with ZK, record the session ID and don't commit any
> write operations if session ID changes.
>
> The fencing token + multi might also work, but that sounds a little bit
> heavier.
>
> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  wrote:
>
>> Hi ZooKeepers,
>>
>> Recently there is an ongoing refactor[1] in Flink community aimed at
>> overcoming several inconsistent state issues on ZK we have met. I come
>> here to share our design of leader election and leader operation. For
>> leader operation, it is operation that should be committed only if the
>> contender is the leader. Also CC Curator mailing list because it also
>> contains the reason why we cannot JUST use Curator.
>>
>> The rule we want to keep is
>>
>> **Writes on ZK must be committed only if the contender is the leader**
>>
>> We represent contender by an individual ZK client. At the moment we use
>> Curator for leader election so the algorithm is the same as the
>> optimized version in this page[2].
>>
>> The problem is that this algorithm only take care of leader election but
>> is indifferent to subsequent operations. Consider the scenario below:
>>
>> 1. contender-1 becomes the leader
>> 2. contender-1 proposes a create txn-1
>> 3. sender thread suspended for full gc
>> 4. contender-1 lost leadership and contender-2 becomes the leader
>> 5. contender-1 recovers from full gc, before it reacts to revoke
>> leadership event, txn-1 retried and sent to ZK.
>>
>> Without other guard txn will success on ZK and thus contender-1 commit
>> a write operation even if it is no longer the leader. This issue is
>> also documented in this note[3].
>>
>> To overcome this issue instead of just saying that we're unfortunate,
>> we draft two possible solution.
>>
>> The first is document here[4]. Briefly, when the contender becomes the
>> leader, we memorize the latch path at that moment. And for
>> subsequent operations, we do in a transaction first checking the
>> existence of the latch path. Leadership is only switched if the latch
>> gone, and all operations will fail if the latch gone.
>>
>> The second is still rough. Basically it relies on session expire
>> mechanism in ZK. We will adopt the unoptimized version in the
>> recipe[2] given that in our scenario there are only few contenders
>> at the same time. Thus we create /leader node as ephemeral znode with
>> leader information and when session expired we think leadership is
>> revoked and terminate the contender. Asynchronous write operations
>> should not succeed because they will all fail on session expire.
>>
>> We cannot adopt 1 using Curator because it doesn't expose the latch
>> path(which is added recently, but not in the version we use); we
>> cannot adopt 2 using Curator because although we have to retry on
>> connection loss but we don't want to retry on session expire. Curator
>> always creates a new client on session 

Re: Leader election and leader operation based on zookeeper

2019-09-20 Thread Michael Han
>> thus contender-1 commit a write operation even if it is no longer the
leader

I am assuming the "write operation" here is write to ZooKeeper (as opposed
to write to an external storage system)? If so:

>> contender-1 recovers from full gc, before it reacts to revoke leadership
event, txn-1 retried and sent to ZK.

contender-2 becomes the leader implies that the ephemeral node appertains
to contender-1 has been removed, which further implies that the session
appertains to contender-1 is either explicitly closed (by client), or
expired. So if contender-1 was still using same client ZooKeeper object,
then it's not possible for txn-1 succeeded as session expire was an event
ordered prior to txn-1, which wouldn't commit after an expired session.

>> Curator always creates a new client on session expire and retry the
operation.
Looks like contender-1 was not reusing same ZooKeeper client object, so
this explains how the previous supposed to be fail operation succeeds?

If my reasoning make sense, one idea might be on Flink side, once you
finish leader election with ZK, record the session ID and don't commit any
write operations if session ID changes.

The fencing token + multi might also work, but that sounds a little bit
heavier.

On Fri, Sep 20, 2019 at 1:31 AM Zili Chen  wrote:

> Hi ZooKeepers,
>
> Recently there is an ongoing refactor[1] in Flink community aimed at
> overcoming several inconsistent state issues on ZK we have met. I come
> here to share our design of leader election and leader operation. For
> leader operation, it is operation that should be committed only if the
> contender is the leader. Also CC Curator mailing list because it also
> contains the reason why we cannot JUST use Curator.
>
> The rule we want to keep is
>
> **Writes on ZK must be committed only if the contender is the leader**
>
> We represent contender by an individual ZK client. At the moment we use
> Curator for leader election so the algorithm is the same as the
> optimized version in this page[2].
>
> The problem is that this algorithm only take care of leader election but
> is indifferent to subsequent operations. Consider the scenario below:
>
> 1. contender-1 becomes the leader
> 2. contender-1 proposes a create txn-1
> 3. sender thread suspended for full gc
> 4. contender-1 lost leadership and contender-2 becomes the leader
> 5. contender-1 recovers from full gc, before it reacts to revoke
> leadership event, txn-1 retried and sent to ZK.
>
> Without other guard txn will success on ZK and thus contender-1 commit
> a write operation even if it is no longer the leader. This issue is
> also documented in this note[3].
>
> To overcome this issue instead of just saying that we're unfortunate,
> we draft two possible solution.
>
> The first is document here[4]. Briefly, when the contender becomes the
> leader, we memorize the latch path at that moment. And for
> subsequent operations, we do in a transaction first checking the
> existence of the latch path. Leadership is only switched if the latch
> gone, and all operations will fail if the latch gone.
>
> The second is still rough. Basically it relies on session expire
> mechanism in ZK. We will adopt the unoptimized version in the
> recipe[2] given that in our scenario there are only few contenders
> at the same time. Thus we create /leader node as ephemeral znode with
> leader information and when session expired we think leadership is
> revoked and terminate the contender. Asynchronous write operations
> should not succeed because they will all fail on session expire.
>
> We cannot adopt 1 using Curator because it doesn't expose the latch
> path(which is added recently, but not in the version we use); we
> cannot adopt 2 using Curator because although we have to retry on
> connection loss but we don't want to retry on session expire. Curator
> always creates a new client on session expire and retry the operation.
>
> I'd like to learn from ZooKeeper community that 1. is there any
> potential risk if we eventually adopt option 1 or option 2? 2. is
> there any other solution we can adopt?
>
> Best,
> tison.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10333
> [2]
> https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection
> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10
> [4]
>
> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing
>