Till Rohrmann created FLINK-10333:
-------------------------------------

             Summary: Rethink ZooKeeper based stores (SubmittedJobGraph, 
MesosWorker, CompletedCheckpoints)
                 Key: FLINK-10333
                 URL: https://issues.apache.org/jira/browse/FLINK-10333
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.6.0, 1.5.3, 1.7.0
            Reporter: Till Rohrmann
             Fix For: 1.7.0


While going over the ZooKeeper based stores 
({{ZooKeeperSubmittedJobGraphStore}}, {{ZooKeeperMesosWorkerStore}}, 
{{ZooKeeperCompletedCheckpointStore}}) and the underlying 
{{ZooKeeperStateHandleStore}} I noticed several inconsistencies which were 
introduced with past incremental changes.

* Depending whether {{ZooKeeperStateHandleStore#getAllSortedByNameAndLock}} or 
{{ZooKeeperStateHandleStore#getAllAndLock}} is called, deserialization problems 
will either lead to removing the Znode or not
* {{ZooKeeperStateHandleStore}} leaves inconsistent state in case of exceptions 
(e.g. {{#getAllAndLock}} won't release the acquired locks in case of a failure)
* {{ZooKeeperStateHandleStore}} has too many responsibilities. It would be 
better to move {{RetrievableStateStorageHelper}} out of it for a better 
separation of concerns
* {{ZooKeeperSubmittedJobGraphStore}} overwrites a stored {{JobGraph}} even if 
it is locked. This should not happen since it could leave another system in an 
inconsistent state (imagine a changed {{JobGraph}} which restores from an old 
checkpoint)
* Redundant but also somewhat inconsistent put logic in the different stores

These problems made me think how reliable these components actually work. Since 
these components are very important, I propose to refactor them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to