[ https://issues.apache.org/jira/browse/FLINK-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhijiang reassigned FLINK-5893: ------------------------------- Assignee: zhijiang > Race condition in removing previous JobManagerRegistration in ResourceManager > ----------------------------------------------------------------------------- > > Key: FLINK-5893 > URL: https://issues.apache.org/jira/browse/FLINK-5893 > Project: Flink > Issue Type: Bug > Components: ResourceManager > Reporter: zhijiang > Assignee: zhijiang > > The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, > and currently there may be two threads to operate the map concurrently to > bring unexpected results. > The scenario is like this : > - {{registerJobManager}}: When the job leader changes and the new JobManager > leader registers to ResourceManager, the new {{JobManagerRegistration}} will > replace the old one in the map with the same key {{JobID}}. This process is > triggered by rpc thread. > - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of > job leader change and trigger the action {{jobLeaderLostLeadership}} in > another thread. In this action, it will remove the previous > {{JobManagerRegistration}} from the map by {{JobID}}, but the old > {{JobManagerRegistration}} may be already replaced by the new one from > {{registerJobManager}}. > In summary, this race condition may cause the new {{JobManagerRegistration}} > removed from ResourceManager, resulting in exception when request slot from > ResourceManager. > Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be > scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for > the map. -- This message was sent by Atlassian JIRA (v6.3.15#6346)