[
https://issues.apache.org/jira/browse/FLINK-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zili Chen updated FLINK-14685:
------------------------------
Description:
Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e.
connection loss, it will set the state as invalid so that all checkpoint id
counter operations succeed will fail.
Although couple with JM leadership management we will generate a new id counter
on re-granted leadership so that it is not a problem so far, the semantic is
wrong because id counter should only check whether current state is
SUSPENDED/LOST.
It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in
{{LeaderLatch}}. [~lamber-ken] provides a
[fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
there.
Besides, in product scenario we once noticed that JM didn't re-elected(it
shouldn't happen after [~trohrmann] add linearized leader operation) on
SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue
and fix this wrong semantic.
CC [~GJL] [~azagrebin]
was:
Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e.
connection loss, it will set the state as invalid so that all checkpoint id
counter operations succeed will fail.
Although couple with JM leadership management we will generate a new id counter
on re-granted leadership so that it is not a problem so far, the semantic is
wrong because id counter should only check whether current state is
SUSPENDED/LOST.
It is also a blocker upgrading to Curator 4.2 and [~lamber-ken] provides a
[fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
there.
Besides, in product scenario we once noticed that JM didn't re-elected(it
shouldn't happen after [~trohrmann] add linearized leader operation) on
SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue
and fix this wrong semantic.
CC [~GJL] [~azagrebin]
> ZooKeeperCheckpointIDCounter forever broken if once loss connection with ZK
> ---------------------------------------------------------------------------
>
> Key: FLINK-14685
> URL: https://issues.apache.org/jira/browse/FLINK-14685
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Zili Chen
> Priority: Major
> Fix For: 1.10.0
>
>
> Currently, if {{ZooKeeperCheckpointIDCounter}} suffers SUSPENDED state i.e.
> connection loss, it will set the state as invalid so that all checkpoint id
> counter operations succeed will fail.
> Although couple with JM leadership management we will generate a new id
> counter on re-granted leadership so that it is not a problem so far, the
> semantic is wrong because id counter should only check whether current state
> is SUSPENDED/LOST.
> It is also a blocker upgrading to Curator 4.2 and tolerate SUSPENDED state in
> {{LeaderLatch}}. [~lamber-ken] provides a
> [fix|https://github.com/BigDataArtisans/flink/commit/bd146ddcd1d9e0501f7e792875f5887edb8b7299]
> there.
> Besides, in product scenario we once noticed that JM didn't re-elected(it
> shouldn't happen after [~trohrmann] add linearized leader operation) on
> SUSPENDED-RECONNECTED very fast so that a JM runs with a broken ID counter.
> I think it is reasonable we pick [~lamber-ken]'s commit as a separated issue
> and fix this wrong semantic.
> CC [~GJL] [~azagrebin]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)