[ https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhu Zhu updated FLINK-14607: ---------------------------- Description: Currently a pending request can only be fulfilled when a physical slot({{AllocatedSlot}}) becomes available in {{SlotPool}}. A shared slot however, cannot be used to fulfill pending requests even if it becomes qualified. This may lead to resource deadlocks in certain cases. For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) with 1 slot only, all vertices are in the same slot sharing group, here's what may happen: 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is pending because a slot cannot host 2 instances of the same JobVertex at the same time. Shared slot status: \{A1\} 2. A1 produces data and triggers the scheduling of B1. Shared slot status {A1, B1} 3. A1 finishes. Shared slot status {B1} 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched due to no physical slot becomes available, even though the slot is qualified for host it now. A resource deadlock happens. Maybe we should improve {{SlotSharingManager}}. One a task slot is released, its root {{MultiTaskSlot}} should be used to try fulfilling existing pending task slots from other pending root slots({{unresolvedRootSlots}}) in this {{SlotSharingManager}}(means in the same slot sharing group). We need to be careful to not cause any failures, and do not violate colocation constraints. cc [~trohrmann] was: Currently a pending request can only be fulfilled when a physical slot({{AllocatedSlot}}) becomes available in {{SlotPool}}. A shared slot however, cannot be used to fulfill pending requests even if it becomes qualified. This may lead to resource deadlocks in certain cases. For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) with 1 slot only, all vertices are in the same slot sharing group, here's what may happen: 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is pending because a slot cannot host 2 instances of the same JobVertex at the same time. Shared slot status {A1} 2. A1 produces data and triggers the scheduling of B1. Shared slot status {A1, B1} 3. A1 finishes. Shared slot status {B1} 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched due to no physical slot becomes available, even though the slot is qualified for host it now. A resource deadlock happens. Maybe we should improve {{SlotSharingManager}}. One a task slot is released, its root {{MultiTaskSlot}} should be used to try fulfilling existing pending task slots from other pending root slots({{unresolvedRootSlots}}) in this {{SlotSharingManager}}(means in the same slot sharing group). We need to be careful to not cause any failures, and do not violate colocation constraints. cc [~trohrmann] > SharedSlot cannot fulfill pending slot requests before it's completely > released > ------------------------------------------------------------------------------- > > Key: FLINK-14607 > URL: https://issues.apache.org/jira/browse/FLINK-14607 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.9.1 > Reporter: Zhu Zhu > Priority: Major > > Currently a pending request can only be fulfilled when a physical > slot({{AllocatedSlot}}) becomes available in {{SlotPool}}. > A shared slot however, cannot be used to fulfill pending requests even if it > becomes qualified. This may lead to resource deadlocks in certain cases. > For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) > with 1 slot only, all vertices are in the same slot sharing group, here's > what may happen: > 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is > pending because a slot cannot host 2 instances of the same JobVertex at the > same time. Shared slot status: \{A1\} > 2. A1 produces data and triggers the scheduling of B1. Shared slot status > {A1, B1} > 3. A1 finishes. Shared slot status {B1} > 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched > due to no physical slot becomes available, even though the slot is qualified > for host it now. A resource deadlock happens. > Maybe we should improve {{SlotSharingManager}}. One a task slot is released, > its root {{MultiTaskSlot}} should be used to try fulfilling existing pending > task slots from other pending root slots({{unresolvedRootSlots}}) in this > {{SlotSharingManager}}(means in the same slot sharing group). > We need to be careful to not cause any failures, and do not violate > colocation constraints. > cc [~trohrmann] -- This message was sent by Atlassian Jira (v8.3.4#803005)