ifndef-SleePy opened a new pull request #9269: [FLINK-9900][tests] Fix unstable ZooKeeperHighAvailabilityITCase URL: https://github.com/apache/flink/pull/9269 ## What is the purpose of the change * Fix unstable `ZooKeeperHighAvailabilityITCase`.`testRestoreBehaviourWithFaultyStateHandles` * The case is designed as below - This case assume that the first 5 checkpoints (1-5) would success - Then the job blocks on the snapshot of checkpoint 6 - At this time, the checkpoint files are moved on purpose - The checkpoint 6 would fail due to an expected snapshot failure - Then the job would be fail due to this failure checkpoint - And the job could not recover from checkpoint 5 because there is no checkpoint file - After moving these checkpoint files back, the job could recover and continue working. * But there is a race condition of failing the job and triggering another checkpoint * There might be an unexpected successful checkpoint 7 if the job canceling is not fast enough * This job could recover from checkpoint 7 without waiting these checkpoint files moved back ## Brief change log * The basic idea of fixing is that preventing the unexpected checkpoint 7 * Add a latch to block snapshot until the HA storage is recovered ## Verifying this change * This change is already covered by existing tests * This unstable scenario can be reproduced as below - There is a race condition of failing the job and triggering another checkpoint - Making the job failing more slowly would reproduce the scenario - Modify the `FailJobCallback` of `CheckpointFailureManager` in `ExecutionGraph`.`enableCheckpointing`, change the `execute` to `schedule` with a delay - There would be an unexpected successful checkpoint 7 - This case would hang forever because it never fail 5 times because it could recover from checkpoint 7 ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? not applicable
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services