[ https://issues.apache.org/jira/browse/FLINK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhu Zhu reopened FLINK-19142: ----------------------------- > Local recovery can be broken if slot hijacking happened during a full restart > ----------------------------------------------------------------------------- > > Key: FLINK-19142 > URL: https://issues.apache.org/jira/browse/FLINK-19142 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Andrey Zagrebin > Assignee: Zhu Zhu > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > The ticket originates from [this PR > discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221]. > The previous AllocationIDs are used by > PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot > where they were previously executed before a failover. If the previous slot > (AllocationID) is not available, we do not want subtasks to take previous > slots (AllocationIDs) of other subtasks. > The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the > bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. > The previous AllocationIDs of other bulks stay unknown. Therefore, the > current bulk can potentially hijack the previous slots from the preceding > bulks. On the other hand the previous AllocationIDs of other tasks should be > taken if the other tasks are not going to run at the same time, e.g. not > enough resources after failover or other bulks are done. > Local recovery can be broken due to this. e.g. when multiple regions of a > streaming job are restarted at the same time(due to global failover, or task > failover with `full` failover strategy). -- This message was sent by Atlassian Jira (v8.3.4#803005)