[jira] [Updated] (FLINK-14685) ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

Zili Chen (Jira) Sat, 09 Nov 2019 05:11:08 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zili Chen updated FLINK-14685:
------------------------------
    Description: 
Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. 
connection loss, it will set the state as invalid so that all checkpoint id 
counter operations succeed will fail.

Although couple with JM leadership management we will generate a new id counter 
on re-granted leadership so that it is not a problem so far, the semantic is 
wrong because id counter should only check whether current state is 
SUSPENDED/LOST. 

It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in 
{{LeaderLatch}}. [~lamber-ken] provides a 
[fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
 there.

Besides, in product scenario we once noticed that JM didn't re-elected(it 
shouldn't happen after [~trohrmann] add linearized leader operation) on 
SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue 
and fix this wrong semantic.

CC [~GJL] [~azagrebin]

  was:
Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. 
connection loss, it will set the state as invalid so that all checkpoint id 
counter operations succeed will fail.

Although couple with JM leadership management we will generate a new id counter 
on re-granted leadership so that it is not a problem so far, the semantic is 
wrong because id counter should only check whether current state is 
SUSPENDED/LOST. 

It is also a blocker upgrading to Curator 4.2 and [~lamber-ken] provides a 
[fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
 there.

Besides, in product scenario we once noticed that JM didn't re-elected(it 
shouldn't happen after [~trohrmann] add linearized leader operation) on 
SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.

I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue 
and fix this wrong semantic.

CC [~GJL] [~azagrebin]


> ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14685
>                 URL: https://issues.apache.org/jira/browse/FLINK-14685
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zili Chen
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e. 
> connection loss, it will set the state as invalid so that all checkpoint id 
> counter operations succeed will fail.
> Although couple with JM leadership management we will generate a new id 
> counter on re-granted leadership so that it is not a problem so far, the 
> semantic is wrong because id counter should only check whether current state 
> is SUSPENDED/LOST. 
> It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in 
> {{LeaderLatch}}. [~lamber-ken] provides a 
> [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
>  there.
> Besides, in product scenario we once noticed that JM didn't re-elected(it 
> shouldn't happen after [~trohrmann] add linearized leader operation) on 
> SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
> I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue 
> and fix this wrong semantic.
> CC [~GJL] [~azagrebin]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-14685) ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK

Reply via email to