XComp commented on pull request #18644: URL: https://github.com/apache/flink/pull/18644#issuecomment-1034821588
I went through the code once more after you raised valid concerns. I reverted my changes and investigated the cancellation code path for the `JobMasterServiceLeadershipRunner`. I noticed one bit which we overlooked before, probably: The `SchedulerBase` calls the shutdown on the checkpoint-related resources and expects this operation to succeed (see SchedulerBase:666](https://github.com/apache/flink/blob/d8a7704a003528f60238ae40f295d0ad696c2780/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L666). Otherwise, it will fail fatally. This would prevent the retry mechanism to kick in but fail the cluster entirely, AFAIU. This `AdaptiveScheduler.closeAsync` is not implemented like that but forwards the future as the result of the `closeAsync` operation. I'm wondering whether we should tackle that as a follow-up task outside of the release work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org