[ 
https://issues.apache.org/jira/browse/FLINK-24038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406350#comment-17406350
 ] 

Xintong Song commented on FLINK-24038:
--------------------------------------

I think option 2) should not work. To deregister an application, it can involve 
interactions with the underlying external resource manager. This is usually 
specific to the underlying system, and is better performed by the 
ResourceManagerDriver. Most importantly, deregistration of an application 
usually means all the process will be terminated, thus a non-leader JobManager 
process could kill a leader process if it is allowed to deregister, which is 
undesired.

Option 1) might work. I would need to look into it a bit more to be sure about 
that. Event this works, my gut feeling the efforts needed and the potential 
impacts on stabilities may not be trivial.

Alternatively, we may consider simply not throwing the error there's not a 
leading resource manager. To be specific, if there is a leading resource 
manager, errors occurred during the deregistration should still be considered 
fatal. But if there's not a leading resource manager, we simply don't do the 
deregistration. For standalone clusters, there should be no difference anyway, 
since the StandaloneResourceManager does not do anything for deregistration. 
For active resource managers, I think it's a good contract that only the 
leading resource manager interacts with the external resource manager (except 
for pure reading operations). The side effect would be, if Flink tries to 
deregister when there's no leader RM, the deregister cannot success and 
K8s/Yarn will bring up another JobManager process anyway, which is the same as 
how it is currently and IMHO not a bit problem.

> DispatcherResourceManagerComponent fails to deregister application if no 
> leading ResourceManager
> ------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-24038
>                 URL: https://issues.apache.org/jira/browse/FLINK-24038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.14.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.14.0
>
>
> With FLINK-21667 we introduced a change that can cause the 
> {{DispatcherResourceManagerComponent}} to fail when trying to stop the 
> application. The problem is that the {{DispatcherResourceManagerComponent}} 
> needs a leading {{ResourceManager}} to successfully execute the 
> stop/deregister application call. If this is not the case, then it will fail 
> fatally. In the case of multiple standby JobManager processes it can happen 
> that the leading {{ResourceManager}} runs somewhere else.
> I do see two possible solutions:
> 1. Run the leader election process for the whole JobManager process
> 2. Move the registration/deregistration of the application out of the 
> {{ResourceManager}} so that it can be executed w/o a leader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to