[ 
https://issues.apache.org/jira/browse/FLINK-14149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zili Chen updated FLINK-14149:
------------------------------
    Description: 
Subsequent to the discussion in FLINK-10333, we reach a consensus that refactor 
ZK based storage with a transaction store mechanism. The overall design can be 
found in the design document linked below.

This subtask is aimed at introducing the prerequisite to adopt transaction 
store, i.e., a new leader election service for ZK scenario. The necessity is 
that we have to retrieve the corresponding latch path per contender following 
the algorithm describe in FLINK-10333.

Here is the (descriptive) details about the implementation.

We adopt the optimized version of [this 
recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1].
 Code details can be found in [this 
branch|https://github.com/TisonKun/flink/tree/election-service] and the state 
machine can be found in the design document attached. Here is only the most 
important difference from the former implementation:

*Leader election is an one-shot service.*

Specifically, we only create one latch for a specific contender. We tolerate 
{{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost 
leadership is session expired, which infers the ephemeral latch znode is 
deleted. We don't re-participant as contender so after {{revokeLeadership}} a 
contender will never be granted any more. This is not a problem but we can do 
further refactor in contender side for better behavior.

Another topic is about interface. Back to the big picture of FLINK-10333 we 
eventually use a transaction store for persisting job graph and checkpoint and 
so on. So there will be a {{getLeaderStore}} method added on 
{{LeaderElectionServices}}. Because we don't use it at all it is an open 
question that whether we add the method to the interface in this subtask. And 
if so, whether we implement it for other election services implementation.

{{concealLeaderInfo}} is another method appeared in the document that aimed at 
clean up leader info node on stop. So the same problem as {{getLeaderStore}}.

**For what we gain**

1. Basics for the overall goal under FLINK-10333
2. Leader info node must be modified by the current leader. Thus we can reduce 
a lot of concurrency handling logic in currently ZLES, including using 
{{NodeCache}} as well as dealing with complex stat of ephemeral leader info 
node.

[1] For other implementation, I start [a 
thread|https://lists.apache.org/x/thread.html/594b66ecb1d60b560a5c4c08ed1b2a67bc29143cb4e8d368da8c39b2@%3Cuser.zookeeper.apache.org%3E]
 in ZK and Curator to discuss. Anyway, it will be implementation details only, 
and interfaces and semantics should not be affected.

  was:
Subsequent to the discussion in FLINK-10333, we reach a consensus that refactor 
ZK based storage with a transaction store mechanism. The overall design can be 
found in the design document linked below.

This subtask is aimed at introducing the prerequisite to adopt transaction 
store, i.e., a new leader election service for ZK scenario. The necessity is 
that we have to retrieve the corresponding latch path per contender following 
the algorithm describe in FLINK-10333.

Here is the (descriptive) details about the implementation.

We adopt the optimized version of [this 
recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1].
 Code details can be found in [this 
branch|https://github.com/TisonKun/flink/tree/election-service] and the state 
machine can be found in the design document attached. Here is only the most 
important two differences from the former implementation:

(1) *Leader election is an one-shot service.*

Specifically, we only create one latch for a specific contender. We tolerate 
{{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost 
leadership is session expired, which infers the ephemeral latch znode is 
deleted. We don't re-participant as contender so after {{revokeLeadership}} a 
contender will never be granted any more. This is not a problem but we can do 
further refactor in contender side for better behavior.

(2) *Leader info znode is {{PERSISTENT}}.*

It is because we now regard create/setData to leader info znode a leader-only 
operation and thus do it in a transaction. If we keep using ephemeral znode it 
is hard to test. Because we share ZK client so the ephemeral znode is not 
deleted so that we should deal with complex znode stat that transaction cannot 
simply deal with. And since znode is {{PERSISTENT}} we introduce a 
{{concealLeaderInfo}} method called back on contender stop to clean up.

Another topic is about interface. Back to the big picture of FLINK-10333 we 
eventually use a transaction store for persisting job graph and checkpoint and 
so on. So there will be a {{getLeaderStore}} method added on 
{{LeaderElectionServices}}. Because we don't use it at all it is an open 
question that whether we add the method to the interface in this subtask. And 
if so, whether we implement it for other election services implementation.

{{concealLeaderInfo}} is another method appeared in the document that aimed at 
clean up leader info node on stop. So the same problem as {{getLeaderStore}}.

**For what we gain**

1. Basics for the overall goal under FLINK-10333
2. Leader info node must be modified by the current leader. Thus we can reduce 
a lot of concurrency handling logic in currently ZLES, including using 
{{NodeCache}} as well as dealing with complex stat of ephemeral leader info 
node.

[1] For other implementation, I start [a 
thread|https://lists.apache.org/x/thread.html/594b66ecb1d60b560a5c4c08ed1b2a67bc29143cb4e8d368da8c39b2@%3Cuser.zookeeper.apache.org%3E]
 in ZK and Curator to discuss. Anyway, it will be implementation details only, 
and interfaces and semantics should not be affected.


> Introduce ZooKeeperLeaderElectionServiceNG
> ------------------------------------------
>
>                 Key: FLINK-14149
>                 URL: https://issues.apache.org/jira/browse/FLINK-14149
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>            Reporter: Zili Chen
>            Assignee: Zili Chen
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Subsequent to the discussion in FLINK-10333, we reach a consensus that 
> refactor ZK based storage with a transaction store mechanism. The overall 
> design can be found in the design document linked below.
> This subtask is aimed at introducing the prerequisite to adopt transaction 
> store, i.e., a new leader election service for ZK scenario. The necessity is 
> that we have to retrieve the corresponding latch path per contender following 
> the algorithm describe in FLINK-10333.
> Here is the (descriptive) details about the implementation.
> We adopt the optimized version of [this 
> recipe|https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection][1].
>  Code details can be found in [this 
> branch|https://github.com/TisonKun/flink/tree/election-service] and the state 
> machine can be found in the design document attached. Here is only the most 
> important difference from the former implementation:
> *Leader election is an one-shot service.*
> Specifically, we only create one latch for a specific contender. We tolerate 
> {{SUSPENDED}} a.k.a. {{CONNECTIONLOSS}} so that the only situation we lost 
> leadership is session expired, which infers the ephemeral latch znode is 
> deleted. We don't re-participant as contender so after {{revokeLeadership}} a 
> contender will never be granted any more. This is not a problem but we can do 
> further refactor in contender side for better behavior.
> Another topic is about interface. Back to the big picture of FLINK-10333 we 
> eventually use a transaction store for persisting job graph and checkpoint 
> and so on. So there will be a {{getLeaderStore}} method added on 
> {{LeaderElectionServices}}. Because we don't use it at all it is an open 
> question that whether we add the method to the interface in this subtask. And 
> if so, whether we implement it for other election services implementation.
> {{concealLeaderInfo}} is another method appeared in the document that aimed 
> at clean up leader info node on stop. So the same problem as 
> {{getLeaderStore}}.
> **For what we gain**
> 1. Basics for the overall goal under FLINK-10333
> 2. Leader info node must be modified by the current leader. Thus we can 
> reduce a lot of concurrency handling logic in currently ZLES, including using 
> {{NodeCache}} as well as dealing with complex stat of ephemeral leader info 
> node.
> [1] For other implementation, I start [a 
> thread|https://lists.apache.org/x/thread.html/594b66ecb1d60b560a5c4c08ed1b2a67bc29143cb4e8d368da8c39b2@%3Cuser.zookeeper.apache.org%3E]
>  in ZK and Curator to discuss. Anyway, it will be implementation details 
> only, and interfaces and semantics should not be affected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to