Till Rohrmann created FLINK-10333: ------------------------------------- Summary: Rethink ZooKeeper based stores (SubmittedJobGraph, MesosWorker, CompletedCheckpoints) Key: FLINK-10333 URL: https://issues.apache.org/jira/browse/FLINK-10333 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.6.0, 1.5.3, 1.7.0 Reporter: Till Rohrmann Fix For: 1.7.0
While going over the ZooKeeper based stores ({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, {{ZooKeeperCompletedCheckpointStore}}) and the underlying {{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were introduced with past incremental changes. * Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} or {{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization problems will either lead to removing the Znode or not * {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of exceptions (e.g. {{#getAllAndLock}} won't release the acquired locks in case of a failure) * {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be better to move {{RetrievableStateStorageHelper}} out of it for a better separation of concerns * {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even if it is locked. This should not happen since it could leave another system in an inconsistent state (imagine a changed {{JobGraph}} which restores from an old checkpoint) * Redundant but also somewhat inconsistent put logic in the different stores These problems made me think how reliable these components actually work. Since these components are very important, I propose to refactor them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)