XComp opened a new pull request #18919:
URL: https://github.com/apache/flink/pull/18919


   ## What is the purpose of the change
   
   This inconsistency was introduced by `c3a6b51` as part of changes done for 
FLINK-19543. It should be, anyway, fixed for consistency reasons.
   
   This issue never appeared because of the following reasons:
   * getAllAndLock is only called by the CompletedCheckpoint
     recovery which happens during the recovery of a job (i.e.
     after failover)
   * the removal happens through releaseAndTryRemove which can
     be called in the following places:
     * through DefaultCompletedCheckpointStore.addCheckpointAndSubsumeOldestOne
       That happens after a job is already recovered and running
     * through DefaultCompletedCheckpointStore.shutdown
       ...where it's only called if the job reached a
       globally-terminal state (i.e. it's not subject to
       job recovery)
     * through DefaultJobGraphStore.globalCleanupAsync
       ...which is also only called on jobs that reached a
       globally-terminal state (i.e. it's not subject to
       job recovery)
     * ZooKeeperStateHandleStore.releaseAndTryRemoveAll
       ...which seems to be legacy code which is not used
       anywhere in production, anymore. I'm leaving it here
       because it might make sense to remove ZooKeeper
       completely, anyway.
   
   ## Brief change log
   
   * Refactored `getAllAndLock` to make it possible to test concurrent node 
changes
   * Fixes the catch statement replacing `KeeperException.NoNodeException` by 
`NotExistException` which was introduced in FLINK-19543
   
   ## Verifying this change
   
   * A test was added to verify the change
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to