[ https://issues.apache.org/jira/browse/FLINK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455105#comment-17455105 ]
Till Rohrmann commented on FLINK-19142: --------------------------------------- [~zhuzh] shall we do the backport? > Local recovery can be broken if slot hijacking happened during a full restart > ----------------------------------------------------------------------------- > > Key: FLINK-19142 > URL: https://issues.apache.org/jira/browse/FLINK-19142 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Andrey Zagrebin > Assignee: Zhu Zhu > Priority: Major > Labels: pull-request-available, stale-assigned > Fix For: 1.15.0, 1.14.1 > > > The ticket originates from [this PR > discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221]. > The previous AllocationIDs are used by > PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot > where they were previously executed before a failover. If the previous slot > (AllocationID) is not available, we do not want subtasks to take previous > slots (AllocationIDs) of other subtasks. > The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the > bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. > The previous AllocationIDs of other bulks stay unknown. Therefore, the > current bulk can potentially hijack the previous slots from the preceding > bulks. On the other hand the previous AllocationIDs of other tasks should be > taken if the other tasks are not going to run at the same time, e.g. not > enough resources after failover or other bulks are done. > Local recovery can be broken due to this. e.g. when multiple regions of a > streaming job are restarted at the same time(due to global failover, or task > failover with `full` failover strategy). -- This message was sent by Atlassian Jira (v8.20.1#820001)