[ https://issues.apache.org/jira/browse/FLINK-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021656#comment-17021656 ]
Hequn Cheng commented on FLINK-14701: ------------------------------------- [~zhuzh] Hi, could I move this issue to 1.9.3 as it is not critical? The release of 1.9.2 is very close. :) > Slot leaks if SharedSlotOversubscribedException happens > ------------------------------------------------------- > > Key: FLINK-14701 > URL: https://issues.apache.org/jira/browse/FLINK-14701 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.9.2 > Reporter: Zhu Zhu > Assignee: Zhu Zhu > Priority: Critical > Labels: pull-request-available > Fix For: 1.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > If a {{SharedSlotOversubscribedException}} happens, the {{MultiTaskSlot}} > will release some of its child {{SingleTaskSlot}}. The triggered releasing > will trigger a re-allocation of the task slot right inside > {{SingleTaskSlot#release(...)}}. So that a previous allocation in > {{SloSharingManager#allTaskSlots}} will be replaced by the new allocation > because they share the same {{slotRequestId}}. > However, the {{SingleTaskSlot#release(...)}} will then invoke > {{MultiTaskSlot#releaseChild}} to release the previous allocation with the > {{slotRequestId}}, which will unexpectedly remove the new allocation from the > {{SloSharingManager}}. > In this way, slot leak happens because the pending slot request is not > tracked by the {{SloSharingManager}} and cannot be released when its payload > terminates. > A test case {{testNoSlotLeakOnSharedSlotOversubscribedException}} which > exhibits this issue can be found in this > [commit|https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14]. > The slot leak blocks the TPC-DS queries on flink 1.10, see FLINK-14674. > To solve it, I'd propose to strengthen the {{MultiTaskSlot#releaseChild}} to > only remove its true child task slot from the {{SloSharingManager}}, i.e. add > a check {{if (child == allTaskSlots.get(child.getSlotRequestId()))}} before > invoking {{allTaskSlots.remove(child.getSlotRequestId())}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)