[jira] [Comment Edited] (FLINK-17330) Avoid scheduling deadlocks caused by cyclic input dependencies between regions

Till Rohrmann (Jira) Fri, 24 Apr 2020 01:07:39 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-17330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091315#comment-17091315
 ]


Till Rohrmann edited comment on FLINK-17330 at 4/24/20, 8:06 AM:
-----------------------------------------------------------------

>> I think it's possible but it may be hard for users to identify whether there 
>> are cyclic dependencies. Most users will have to choose the mode to set all 
>> edges BLOCKING to be safe and lose the benefit of pipelined region 
>> scheduling. So if we'd like to take it this way, I think it's better we do 
>> it automatically for users, i.e. override GlobalDataExchangeMode to be 
>> ALL_EDGES_BLOCKING if cyclic dependency is detected.

I would actually suggest to throw an exception with the hint to set 
{{GlobalDataExchangeMode}} to {{ALL_EDGES_BLOCKING}}. It is not a perfect 
solution but it reduces complexity because we don't have to implement some 
magic which might be surprising and won't last very long. It depends also a bit 
on how difficult/complex it would be to implement such an automatic fall back.

I think the main question is whether we consider this feature to be required 
for the MVP or not. I believe that even with this limitation we will add 
additional value for our users because in many cases they won't be affected. If 
they are affected, then they have clear instructions how to work around the 
problem. Moreover, it could also be possible that we are actually able to solve 
this problem after the MVP has been completed and before the release in which 
the MVP will be shipped. That way nobody will be affected. If we do it the 
other way around (fixing this problem for the MVP to complete) we might risk 
missing a release and hence not shipping improvements to the user. I guess I'm 
mainly arguing from a project management point of view here by trying to keep 
the scope as small as possible and advocating for incremental steps.

>> 2. how to detect cyclic dependencies? Checking whether there are 
>> intra-region all-to-all blocking edges can be a performance efficient 
>> solution but is not the only choice, and it also requires attention to 
>> POINTWISE edges. If we can have a common way to find out cyclic dependencies 
>> in O(V^2), I think it's even better. This question can be answered later 
>> when we have a deeper look at all the options.

I agree. I think we need to take a look at possible algorithms. Maybe [Tarjan's 
strongly connected components 
algorithm|https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm]
 could be a good fit for the task.


was (Author: till.rohrmann):
>> I think it's possible but it may be hard for users to identify whether there 
>> are cyclic dependencies. Most users will have to choose the mode to set all 
>> edges BLOCKING to be safe and lose the benefit of pipelined region 
>> scheduling. So if we'd like to take it this way, I think it's better we do 
>> it automatically for users, i.e. override GlobalDataExchangeMode to be 
>> ALL_EDGES_BLOCKING if cyclic dependency is detected.

I would actually suggest to throw an exception with the hint to set 
{{GlobalDataExchangeMode}} to {{ALL_EDGES_BLOCKING}}. It is not a perfect 
solution but it reduces complexity because we don't have to implement some 
magic which might be surprising and won't last very long.

I think the main question is whether we consider this feature to be required 
for the MVP or not. I believe that even with this limitation we will add 
additional value for our users because in many cases they won't be affected. If 
they are affected, then they have clear instructions how to work around the 
problem. Moreover, it could also be possible that we are actually able to solve 
this problem after the MVP has been completed and before the release in which 
the MVP will be shipped. That way nobody will be affected. If we do it the 
other way around (fixing this problem for the MVP to complete) we might risk 
missing a release and hence not shipping improvements to the user. I guess I'm 
mainly arguing from a project management point of view here by trying to keep 
the scope as small as possible and advocating for incremental steps.

>> 2. how to detect cyclic dependencies? Checking whether there are 
>> intra-region all-to-all blocking edges can be a performance efficient 
>> solution but is not the only choice, and it also requires attention to 
>> POINTWISE edges. If we can have a common way to find out cyclic dependencies 
>> in O(V^2), I think it's even better. This question can be answered later 
>> when we have a deeper look at all the options.

I agree. I think we need to take a look at possible algorithms. Maybe [Tarjan's 
strongly connected components 
algorithm|https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm]
 could be a good fit for the task.

> Avoid scheduling deadlocks caused by cyclic input dependencies between regions
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-17330
>                 URL: https://issues.apache.org/jira/browse/FLINK-17330
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.11.0
>
>
> Imagine a job like this:
> A -- (pipelined FORWARD) --> B -- (blocking ALL-to-ALL) --> D
> A -- (pipelined FORWARD) --> C -- (pipelined FORWARD) --> D
> parallelism=2 for all vertices.
> We will have 2 execution pipelined regions:
> R1 = {A1, B1, C1, D1}
> R2 = {A2, B2, C2, D2}
> R1 has a cross-region input edge (B2->D1).
> R2 has a cross-region input edge (B1->D2).
> Scheduling deadlock will happen since we schedule a region only when all its 
> inputs are consumable (i.e. blocking partitions to be finished). This is 
> because R1 can be scheduled only if R2 finishes, while R2 can be scheduled 
> only if R1 finishes.
> To avoid this, one solution is to force a logical pipelined region with 
> intra-region ALL-to-ALL blocking edges to form one only execution pipelined 
> region, so that there would not be cyclic input dependency between regions.
> Besides that, we should also pay attention to avoid cyclic cross-region 
> POINTWISE blocking edges. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-17330) Avoid scheduling deadlocks caused by cyclic input dependencies between regions

Reply via email to