Zhu Zhu created FLINK-14701:
-------------------------------

             Summary: Slot leaks if SharedSlotOversubscribedException happens
                 Key: FLINK-14701
                 URL: https://issues.apache.org/jira/browse/FLINK-14701
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.10.0, 1.9.2
            Reporter: Zhu Zhu
             Fix For: 1.10.0, 1.9.2


If a {{SharedSlotOversubscribedException}} happens, the {{MultiTaskSlot}} will 
release some of its child {{SingleTaskSlot}}. The triggered releasing will 
trigger a re-allocation of the task slot right inside 
{{SingleTaskSlot#release(...)}}. So that a previous allocation 
 in {{SloSharingManager#allTaskSlots}} will be replaced by the new allocation 
because they share the same {{slotRequestId}}.
However, the {{SingleTaskSlot#release(...)}} will then invoke 
{{MultiTaskSlot#releaseChild}} to release the previous allocation with the 
{{slotRequestId}}, which will unexpectedly remove the new allocation from the 
{{SloSharingManager}}.
In this way, slot leak happens because the pending slot request is not tracked 
by the {{SloSharingManager}} and cannot be released when its payload terminates.

A test case {{testNoSlotLeakOnSharedSlotOversubscribedException}} which 
exhibits this issue can be found in this 
[commit|https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14].

The slot leak blocks the TPC-DS queries on flink 1.10, see FLINK-14674.

To solve it, I'd propose to strengthen the {{MultiTaskSlot#releaseChild}} to 
only remove its true child task slot from the {{SloSharingManager}}, i.e. add a 
check {{if (child == allTaskSlots.get(child.getSlotRequestId()))}} before 
invoking {{allTaskSlots.remove(child.getSlotRequestId())}}.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to