Till Rohrmann created FLINK-12863:
-------------------------------------

             Summary: Race condition between slot offerings and 
AllocatedSlotReport
                 Key: FLINK-12863
                 URL: https://issues.apache.org/jira/browse/FLINK-12863
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.9.0
            Reporter: Till Rohrmann
             Fix For: 1.9.0


With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by the 
{{TaskExecutor}} to synchronize its internal view on slot allocations with the 
view of the {{JobMaster}}. It seems that there is a race condition between 
offering slots and receiving the report because the {{AllocatedSlotReport}} is 
sent by the {{HeartbeatManagerSenderImpl}} from a separate thread. 

Due to that it can happen that we generate an {{AllocatedSlotReport}} just 
before getting new slots offered. Since the report is sent from a different 
thread, it can then happen that the response to the slot offerings is sent 
earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an 
outdated slot report on the {{TaskExecutor}} causing active slots to be 
released.

In order to solve the problem I propose to add a fencing token to the 
{{AllocatedSlotReport}} which is being updated whenever we offer new slots to 
the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the 
{{TaskExecutor}} we compare the current slot report fencing token with the 
received one and only process the report if they are equal. Otherwise we wait 
for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to