[ https://issues.apache.org/jira/browse/FLINK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-6612: --------------------------------- Priority: Blocker (was: Critical) > ZooKeeperStateHandleStore does not guard against concurrent delete operations > ----------------------------------------------------------------------------- > > Key: FLINK-6612 > URL: https://issues.apache.org/jira/browse/FLINK-6612 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, State Backends, Checkpointing > Affects Versions: 1.3.0, 1.4.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Blocker > Fix For: 1.3.0, 1.4.0 > > > The {{ZooKeeperStateHandleStore}} does not guard against concurrent delete > operations which could happen in case of a lost leadership and a new > leadership grant. The problem is that checkpoint nodes can get deleted even > after they have been recovered by another > {{ZooKeeperCompletedCheckpointStore}}. This corrupts the recovered checkpoint > and thwarts future recoveries. > I propose to add reference counting to the {{ZooKeeperStateHandleStore}}. > That way, we can monitor how many concurrent processes have a hold on a given > checkpoint node. Only if the reference count reaches {{0}}, we are allowed to > delete the checkpoint node and dispose the checkpoint data. > Stephan proposed to use ephemeral child nodes to track the reference count of > a checkpoint node. That way we are sure that locks on the a checkpoint node > are released in case of {{JobManager}} failures. -- This message was sent by Atlassian JIRA (v6.3.15#6346)