Till Rohrmann created FLINK-25893:
-------------------------------------

             Summary: ResourceManagerServiceImpl's lifecycle can lead to 
exceptions
                 Key: FLINK-25893
                 URL: https://issues.apache.org/jira/browse/FLINK-25893
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.14.3, 1.15.0
            Reporter: Till Rohrmann


The {{ResourceManagerServiceImpl}} lifecycle can lead to exceptions when 
calling {{ResourceManagerServiceImpl.deregisterApplication}}. The problem 
arises when the {{DispatcherResourceManagerComponent}} is shutdown before the 
{{ResourceManagerServiceImpl}} gains leadership or while it is starting the 
{{ResourceManager}}.

One problem is that {{deregisterApplication}} returns an exceptionally 
completed future if there is no leading {{ResourceManager}}.

Another problem is that if there is a leading {{ResourceManager}}, then it can 
still be the case that it has not been started yet. If this is the case, then 
[ResourceManagerGateway.deregisterApplication|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManagerServiceImpl.java#L143]
 will be discarded. The reason for this behaviour is that we create a 
{{ResourceManager}} in one {{Runnable}} and only start it in another. Due to 
this there can be the {{deregisterApplication}} call that gets the {{lock}} in 
between.

I'd suggest to correct the lifecycle and contract of the 
{{ResourceManagerServiceImpl.deregisterApplication}}.

Please note that due to this problem, the error reporting of this method has 
been suppressed. See FLINK-25885 for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to