[ https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406350#comment-17406350 ]
Xintong Song commented on FLINK-24038: -------------------------------------- I think option 2) should not work. To deregister an application, it can involve interactions with the underlying external resource manager. This is usually specific to the underlying system, and is better performed by the ResourceManagerDriver. Most importantly, deregistration of an application usually means all the process will be terminated, thus a non-leader JobManager process could kill a leader process if it is allowed to deregister, which is undesired. Option 1) might work. I would need to look into it a bit more to be sure about that. Event this works, my gut feeling the efforts needed and the potential impacts on stabilities may not be trivial. Alternatively, we may consider simply not throwing the error there's not a leading resource manager. To be specific, if there is a leading resource manager, errors occurred during the deregistration should still be considered fatal. But if there's not a leading resource manager, we simply don't do the deregistration. For standalone clusters, there should be no difference anyway, since the StandaloneResourceManager does not do anything for deregistration. For active resource managers, I think it's a good contract that only the leading resource manager interacts with the external resource manager (except for pure reading operations). The side effect would be, if Flink tries to deregister when there's no leader RM, the deregister cannot success and K8s/Yarn will bring up another JobManager process anyway, which is the same as how it is currently and IMHO not a bit problem. > DispatcherResourceManagerComponent fails to deregister application if no > leading ResourceManager > ------------------------------------------------------------------------------------------------ > > Key: FLINK-24038 > URL: https://issues.apache.org/jira/browse/FLINK-24038 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.14.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.14.0 > > > With FLINK-21667 we introduced a change that can cause the > {{DispatcherResourceManagerComponent}} to fail when trying to stop the > application. The problem is that the {{DispatcherResourceManagerComponent}} > needs a leading {{ResourceManager}} to successfully execute the > stop/deregister application call. If this is not the case, then it will fail > fatally. In the case of multiple standby JobManager processes it can happen > that the leading {{ResourceManager}} runs somewhere else. > I do see two possible solutions: > 1. Run the leader election process for the whole JobManager process > 2. Move the registration/deregistration of the application out of the > {{ResourceManager}} so that it can be executed w/o a leader -- This message was sent by Atlassian Jira (v8.3.4#803005)