edu05 commented on pull request #16487:
URL: https://github.com/apache/flink/pull/16487#issuecomment-885344162


   Hi @dmvk while writing the acceptance test I found a couple of things that 
don't quite make sense to me at the moment. Could you help please?
   
   1. I have found the new call to the recover method to still be in the 
JobMaster's main thread, not outside of it as desired. You can see this by 
debugging the new IT I added to the PR with a breakpoint inside recover. I'm 
attaching a sample image, notice how the call to recover is made from 
SchedulerUtils (as intended) but that call is in turn made from inside 
JobMaster's main thread, not outside.
   
![debug](https://user-images.githubusercontent.com/1392421/126728056-e14f36b4-bd74-4f9e-a4d6-807c98bf6b51.png)
   
   2. Even if the call was made from a separate thread, the first call to 
recover would only "warm up" for the period of time before the second call to 
recover via CheckpointCoordinator. If the delay between both calls is shorter 
than the time it takes for the first recover to execute, the JobMaster will 
become stalled at that point and unable to take RPC calls.
   
   Does that make sense?
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to