XComp commented on pull request #18644:
URL: https://github.com/apache/flink/pull/18644#issuecomment-1034821588


   I went through the code once more after you raised valid concerns. I 
reverted my changes and investigated the cancellation code path for the 
`JobMasterServiceLeadershipRunner`. I noticed one bit which we overlooked 
before, probably: The `SchedulerBase` calls the shutdown on the 
checkpoint-related resources and expects this operation to succeed (see 
SchedulerBase:666](https://github.com/apache/flink/blob/d8a7704a003528f60238ae40f295d0ad696c2780/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java#L666).
 Otherwise, it will fail fatally. This would prevent the retry mechanism to 
kick in but fail the cluster entirely, AFAIU.
   
   This `AdaptiveScheduler.closeAsync` is not implemented like that but 
forwards the future as the result of the `closeAsync` operation. I'm wondering 
whether we should tackle that as a follow-up task outside of the release work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to