yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495481836 > About the problem with the SQL test, having the detailed logs from JM/TMs would be helpful. However, if I see correctly, those are batch tests and should not care about changes to checkpointing - so the error might very well be unrelated. @StefanRRichter I have debugged in my local. It seems the problem comes from the mechanism of failing job. The `DeduplicateITCase` also triggered job fail because of `CheckpointDeclineTaskNotReadyException`. But in `ExecutionGraph#failGlobal` method, it should check the main thread by calling `assertRunningInJobMasterMainThread` method. I found it can not jump out from this method. My guess is the trigger thread is the Timer in `CheckpointCoordinator`, not the main thread. So we may figure out a new way to fail the job.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services