yanghua commented on issue #8322: [FLINK-12364] Introduce a 
CheckpointFailureManager to centralized manage checkpoint failure
URL: https://github.com/apache/flink/pull/8322#issuecomment-495481836
 
 
   > About the problem with the SQL test, having the detailed logs from JM/TMs 
would be helpful. However, if I see correctly, those are batch tests and should 
not care about changes to checkpointing - so the error might very well be 
unrelated.
   
   @StefanRRichter  I have debugged in my local. It seems the problem comes 
from the mechanism of failing job. The `DeduplicateITCase` also triggered job 
fail because of `CheckpointDeclineTaskNotReadyException`. But in 
`ExecutionGraph#failGlobal` method, it should check the main thread by calling 
`assertRunningInJobMasterMainThread` method. I found it can not jump out from 
this method. My guess is the trigger thread is the Timer in 
`CheckpointCoordinator`, not the main thread.
   
   So we may figure out a new way to fail the job.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to