[ 
https://issues.apache.org/jira/browse/FLINK-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-5893:
----------------------------
    Description: 
The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, 
and currently there may be two threads to operate the map concurrently to bring 
unexpected results.

The scenario is like this :

 - {{registerJobManager}}: When the job leader changes and the new JobManager 
leader registers to ResourceManager, the new {{JobManagerRegistration}} will 
replace the old one in the map with the same key {{JobID}}. This process is 
triggered by rpc thread.

 - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of 
job leader change and trigger the action {{jobLeaderLostLeadership}} in another 
thread. In this action, it will remove the previous {{JobManagerRegistration}} 
from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be 
already replaced by the new one from {{registerJobManager}}.

In summary, this race condition may cause the new {{JobManagerRegistration}} 
removed from ResourceManager, resulting in exception when request slot from 
ResourceManager. It can occur in small probability when running JobManager 
failure ITCase.

Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be 
scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the 
map.

  was:
The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, 
and currently there may be two threads to operate the map concurrently to bring 
unexpected results.

The scenario is like this :

 - {{registerJobManager}}: When the job leader changes and the new JobManager 
leader registers to ResourceManager, the new {{JobManagerRegistration}} will 
replace the old one in the map with the same key {{JobID}}. This process is 
triggered by rpc thread.

 - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of 
job leader change and trigger the action {{jobLeaderLostLeadership}} in another 
thread. In this action, it will remove the previous {{JobManagerRegistration}} 
from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be 
already replaced by the new one from {{registerJobManager}}.

In summary, this race condition may cause the new {{JobManagerRegistration}} 
removed from ResourceManager, resulting in exception when request slot from 
ResourceManager.

Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be 
scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the 
map.


> Race condition in removing previous JobManagerRegistration in ResourceManager
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-5893
>                 URL: https://issues.apache.org/jira/browse/FLINK-5893
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>            Reporter: zhijiang
>            Assignee: zhijiang
>
> The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, 
> and currently there may be two threads to operate the map concurrently to 
> bring unexpected results.
> The scenario is like this :
>  - {{registerJobManager}}: When the job leader changes and the new JobManager 
> leader registers to ResourceManager, the new {{JobManagerRegistration}} will 
> replace the old one in the map with the same key {{JobID}}. This process is 
> triggered by rpc thread.
>  - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of 
> job leader change and trigger the action {{jobLeaderLostLeadership}} in 
> another thread. In this action, it will remove the previous 
> {{JobManagerRegistration}} from the map by {{JobID}}, but the old 
> {{JobManagerRegistration}} may be already replaced by the new one from 
> {{registerJobManager}}.
> In summary, this race condition may cause the new {{JobManagerRegistration}} 
> removed from ResourceManager, resulting in exception when request slot from 
> ResourceManager. It can occur in small probability when running JobManager 
> failure ITCase.
> Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be 
> scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for 
> the map.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to