[ https://issues.apache.org/jira/browse/FLINK-19142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456104#comment-17456104 ]
Zhu Zhu commented on FLINK-19142: --------------------------------- 1.14: 63cf221ca2d963aa1394fb0244a8702b9e8c3835 347becbe43209fb9c65bcc8ae3859a071469e587 > Local recovery can be broken if slot hijacking happened during a full restart > ----------------------------------------------------------------------------- > > Key: FLINK-19142 > URL: https://issues.apache.org/jira/browse/FLINK-19142 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.12.0 > Reporter: Andrey Zagrebin > Assignee: Zhu Zhu > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0, 1.14.1 > > > The ticket originates from [this PR > discussion|https://github.com/apache/flink/pull/13181#discussion_r481087221]. > The previous AllocationIDs are used by > PreviousAllocationSlotSelectionStrategy to schedule subtasks into the slot > where they were previously executed before a failover. If the previous slot > (AllocationID) is not available, we do not want subtasks to take previous > slots (AllocationIDs) of other subtasks. > The MergingSharedSlotProfileRetriever gets all previous AllocationIDs of the > bulk from SlotSharingExecutionSlotAllocator but only from the current bulk. > The previous AllocationIDs of other bulks stay unknown. Therefore, the > current bulk can potentially hijack the previous slots from the preceding > bulks. On the other hand the previous AllocationIDs of other tasks should be > taken if the other tasks are not going to run at the same time, e.g. not > enough resources after failover or other bulks are done. > Local recovery can be broken due to this. e.g. when multiple regions of a > streaming job are restarted at the same time(due to global failover, or task > failover with `full` failover strategy). -- This message was sent by Atlassian Jira (v8.20.1#820001)