[ 
https://issues.apache.org/jira/browse/FLINK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309512#comment-17309512
 ] 

Ricky Burnett commented on FLINK-21980:
---------------------------------------

I think the best solution would be to check if the node exists before setting.  
If it doesn't then only ensure that runningJobPath exists and create the node 
with data.

Given the lifecycle of the node, it alternatively seems reasonable to treat an 
empty node as PENDING instead of an exception.

> ZooKeeperRunningJobsRegistry creates an empty znode
> ---------------------------------------------------
>
>                 Key: FLINK-21980
>                 URL: https://issues.apache.org/jira/browse/FLINK-21980
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.3, 1.10.3, 1.11.3, 1.12.2
>            Reporter: Ricky Burnett
>            Priority: Critical
>             Fix For: 1.12.3
>
>
> ZooKeeperRunningJobsRegistry#writeEnumToZooKeeper calls
> {code:java}
> this.client.newNamespaceAwareEnsurePath(zkPath).ensure(client.getZookeeperClient());{code}
> This creates an empty znode in zookeeper.  If the job manager is interrupted 
> at this point the job manager cannot recover.  When trying to restore jobs on 
> a restarted job manager, ZooKeeperRunningJobsRegistry#getJobSchedulingStatus 
> will throw an exception due to the empty znode. 
> Behavior was verified in a test environment where the job manager was 
> interrupted at that point in execution leaving ZK in the following state:
> {code:java}
> zk: localhost:2181(CONNECTED) 2] ls /flink/default
> [checkpoint-counter, checkpoints, jobgraphs, leader, leaderlatch, 
> running_job_registry]
> [zk: localhost:2181(CONNECTED) 3] ls /flink/default/running_job_registry 
> [c982053dd0b9100967e6a9d89202f2a5]
> [zk: localhost:2181(CONNECTED) 4] get 
> /flink/default/running_job_registry/c982053dd0b9100967e6a9d89202f2a5 
> [zk: localhost:2181(CONNECTED) 5] 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to