zhijiang created FLINK-5893:
-------------------------------

             Summary: Race condition in removing previous 
JobManagerRegistration in ResourceManager
                 Key: FLINK-5893
                 URL: https://issues.apache.org/jira/browse/FLINK-5893
             Project: Flink
          Issue Type: Bug
          Components: ResourceManager
            Reporter: zhijiang


The map of {{JobManagerRegistration}} in {{ResourceManager}} is not 
thread-safe, and currently there may be two threads to operate the map 
concurrently to bring unexpected results.

The scenario is like this :

{{registerJobManager}}: When the job leader changes and the new JobManager 
leader registers to ResourceManager, the new {{JobManagerRegistration}} will 
replace the old one in the map with the same key {{JobID}}. This process is 
triggered by rpc thread.

Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of job 
leader change and trigger the action {{jobLeaderLostLeadership}} in another 
thread. In this action, it will remove the previous {{JobManagerRegistration}} 
from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be 
already replaced by the new one from {{registerJobManager}}.

In summary, this race condition may cause the new {{JobManagerRegistration}} 
removed from ResourceManager, resulting in exception when request slot from 
ResourceManager.

Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be 
scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the 
map.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to