dmvk commented on pull request #16487: URL: https://github.com/apache/flink/pull/16487#issuecomment-885454626
Hi @edu05, From the attached screenshot you can see that JobMaster initialization happens in "jobmanager-future-thread-1", before it confirms leadership. This is not a concern, because until leadership is confirmed, we just fail any RPC with ... ``` org.apache.flink.runtime.dispatcher.UnavailableDispatcherOperationException: Unable to get JobMasterGateway for initializing job. The requested operation is not available while the JobManager is initializing. ``` The actual problem is, when there is an actual recovery, that happens over RPC (eg. task failure). Every JobMaster is tied to a single RPC thread (this is what is referred to as a mainThread in JobMaster's code, which may be a little confusing) and if this thread is blocked, is can not response to any other RPC. > Even if the call was made from a separate thread, the first call to recover would only "warm up" for the period of time before the second call to recover via CheckpointCoordinator. If the delay between both calls is shorter than the time it takes for the first recover to execute, the JobMaster will become stalled at that point and unable to take RPC calls. Call to `recover()` is blocking, so JobMaster initialization won't happen until this initial "warmup" call to `recover()` finishes. We confirm leadership after this. See `JobMasterServiceLeadershipRunner#createNewJobMasterServiceProcess` for more details. --- The new `CheckpointStoreRPCITCase` you've introduced does not stress the "RPC" recovery code path, unlike the example I've send you. Also it would be nice to make the new test part of already existing `CheckpointStoreITCase` instead of introducing a new test class. --- Since the 1.14 release is getting closer, and there is still lot of tasks to be done, I'd timebox this effort until Monday 26th (inclusive). If we're not able to make this work until then, I'd take this over, so we can move on to the next task. This task is really complex and requires lot of context, so I hope you won't have any hard feelings if we do this. --- Good job so far! ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
