[jira] [Commented] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

Till Rohrmann (JIRA) Mon, 17 Jun 2019 07:44:34 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865665#comment-16865665
 ]


Till Rohrmann commented on FLINK-12863:
---------------------------------------

Thanks for mentioning the heartbeat issue between TM and RM [~xiaogang.shi] and 
[~gaoyunhaii]. I would suggest to continue the discussion for this problem on 
FLINK-12865.

[~tiemsn] I agree with you that keeping the fencing token up to date can be a 
bit of a hassle. One would need to include the fencing token in the heartbeat 
response from the TM to the JM.

The problem with an increasing version is what do you do in case of an overflow?

Instead of going this way, I'm actually now in favor of removing the 
concurrency from the {{HeartbeatManager}}. This would execute all heartbeat 
calls in the actor's main thread. If the heartbeat response would be generated 
synchronously, the race condition should be solved. I actually believe that 
this should also solve FLINK-12865 without having to introduce the versions.

> Race condition between slot offerings and AllocatedSlotReport
> -------------------------------------------------------------
>
>                 Key: FLINK-12863
>                 URL: https://issues.apache.org/jira/browse/FLINK-12863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.7.3, 1.9.0, 1.8.1
>
>
> With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by 
> the {{TaskExecutor}} to synchronize its internal view on slot allocations 
> with the view of the {{JobMaster}}. It seems that there is a race condition 
> between offering slots and receiving the report because the 
> {{AllocatedSlotReport}} is sent by the {{HeartbeatManagerSenderImpl}} from a 
> separate thread. 
> Due to that it can happen that we generate an {{AllocatedSlotReport}} just 
> before getting new slots offered. Since the report is sent from a different 
> thread, it can then happen that the response to the slot offerings is sent 
> earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an 
> outdated slot report on the {{TaskExecutor}} causing active slots to be 
> released.
> In order to solve the problem I propose to add a fencing token to the 
> {{AllocatedSlotReport}} which is being updated whenever we offer new slots to 
> the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the 
> {{TaskExecutor}} we compare the current slot report fencing token with the 
> received one and only process the report if they are equal. Otherwise we wait 
> for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

Reply via email to