[jira] [Comment Edited] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

Zhu Zhu (Jira) Tue, 05 Nov 2019 03:55:37 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967449#comment-16967449
 ]


Zhu Zhu edited comment on FLINK-14607 at 11/5/19 11:53 AM:
-----------------------------------------------------------

With FLIP-53, it may happen for a *logical region* in a batch job, if it 
contains *PIPELINED* edges, runs with fewer slots than its parallelism. So it 
might be worse if a job has multiple such *logical regions*.
However, it's not a problem for streaming jobs, nor for batch jobs with all 
edges BLOCKING.

I also have concerns that it's not easy work and is possible to introduce 
severe bugs.
So I think it's fine to de-prioritize it if we have other ways to avoid it in 
near versions.

If I understand correctly, you mean to batch scheduling a pipelined region, 
just like what we do for streaming jobs? If so, I think we are almost there, 
except for a new scheduling strategy which always schedules a whole region each 
time. And it's promising to have it in 1.11.


was (Author: zhuzh):
With FLIP-53, it may happen for a *logical region* in a batch job, if it 
contains *PIPELINED* edges, runs with fewer slots than its parallelism. So it 
might be worse if a job has multiple such *logical regions*.
However, it's not a problem for streaming jobs, nor for batch jobs with all 
edges BLOCKING.

I also have concerns that it's not easy work and is possible to introduce 
severe bugs.
So I think it's fine to de-prioritize it if we have other ways to avoid it in 
near versions.

> SharedSlot cannot fulfill pending slot requests before it's completely 
> released
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-14607
>                 URL: https://issues.apache.org/jira/browse/FLINK-14607
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0, 1.9.1
>            Reporter: Zhu Zhu
>            Priority: Major
>
> Currently a pending request can only be fulfilled when a physical 
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it 
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2) 
> with 1 slot only, all vertices are in the same slot sharing group, here's 
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is 
> pending because a slot cannot host 2 instances of the same JobVertex at the 
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status: 
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched 
> due to no physical slot becomes available, even though the shred slot is 
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released, 
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending 
> task slots from other pending root slots({{unresolvedRootSlots}}) in this 
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate 
> colocation constraints.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-14607) SharedSlot cannot fulfill pending slot requests before it's completely released

Reply via email to