Yeah, Ted - I think this is basically the same thing. We should all try to poke 
holes in this.

-JZ

> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> 
> I would suggest that using an epoch number stored in ZK might be helpful. 
> Every operation that the master takes could be made conditional on the epoch 
> number using a multi-transaction.
> 
> Unfortunately, as you say, you have to have the update of the epoch be atomic 
> with becoming leader. 
> 
> The natural way to do this is to have an update of an epoch file be part of 
> the leader election, but that probably isn't possible using Curator. The way 
> I would tend to do it would be have a persistent file that is updated 
> atomically as part of leader election. The version of that persistent file 
> could then be used as the epoch number. All updates to files that are gated 
> on the epoch number would only proceed if no other master has been elected, 
> at least if you use the sync option.
> 
> 
> 
> 
> 
> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4...@gmail.com 
> <mailto:wander4...@gmail.com>> wrote:
> Hi ZooKeepers,
> 
> Recently there is an ongoing refactor[1] in Flink community aimed at
> overcoming several inconsistent state issues on ZK we have met. I come
> here to share our design of leader election and leader operation. For
> leader operation, it is operation that should be committed only if the
> contender is the leader. Also CC Curator mailing list because it also
> contains the reason why we cannot JUST use Curator.
> 
> The rule we want to keep is
> 
> **Writes on ZK must be committed only if the contender is the leader**
> 
> We represent contender by an individual ZK client. At the moment we use
> Curator for leader election so the algorithm is the same as the
> optimized version in this page[2].
> 
> The problem is that this algorithm only take care of leader election but
> is indifferent to subsequent operations. Consider the scenario below:
> 
> 1. contender-1 becomes the leader
> 2. contender-1 proposes a create txn-1
> 3. sender thread suspended for full gc
> 4. contender-1 lost leadership and contender-2 becomes the leader
> 5. contender-1 recovers from full gc, before it reacts to revoke
> leadership event, txn-1 retried and sent to ZK.
> 
> Without other guard txn will success on ZK and thus contender-1 commit
> a write operation even if it is no longer the leader. This issue is
> also documented in this note[3].
> 
> To overcome this issue instead of just saying that we're unfortunate,
> we draft two possible solution.
> 
> The first is document here[4]. Briefly, when the contender becomes the
> leader, we memorize the latch path at that moment. And for
> subsequent operations, we do in a transaction first checking the
> existence of the latch path. Leadership is only switched if the latch
> gone, and all operations will fail if the latch gone.
> 
> The second is still rough. Basically it relies on session expire
> mechanism in ZK. We will adopt the unoptimized version in the
> recipe[2] given that in our scenario there are only few contenders
> at the same time. Thus we create /leader node as ephemeral znode with
> leader information and when session expired we think leadership is
> revoked and terminate the contender. Asynchronous write operations
> should not succeed because they will all fail on session expire.
> 
> We cannot adopt 1 using Curator because it doesn't expose the latch
> path(which is added recently, but not in the version we use); we
> cannot adopt 2 using Curator because although we have to retry on
> connection loss but we don't want to retry on session expire. Curator
> always creates a new client on session expire and retry the operation.
> 
> I'd like to learn from ZooKeeper community that 1. is there any
> potential risk if we eventually adopt option 1 or option 2? 2. is
> there any other solution we can adopt?
> 
> Best,
> tison.
> 
> [1] https://issues.apache.org/jira/browse/FLINK-10333 
> <https://issues.apache.org/jira/browse/FLINK-10333>
> [2] https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection 
> <https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection>
> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 
> <https://cwiki.apache.org/confluence/display/CURATOR/TN10>
> [4] 
> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing
>  
> <https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing>
> 

Reply via email to