[jira] [Updated] (FLINK-8732) Cancel scheduling operation when cancelling the ExecutionGraph

Till Rohrmann (JIRA) Wed, 21 Feb 2018 08:06:26 -0800

     [ 
https://issues.apache.org/jira/browse/FLINK-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann updated FLINK-8732:
---------------------------------
    Description: 
With the Flip-6 changes and the support for queued scheduling, the 
{{ExecutionGraph}} must be able to handle cancellation calls when it is not yet 
fully scheduled. This is for example the case when waiting for new containers.

A cancellation will cancel all {{Executions}}. As a result, available slots can 
get assigned to other {{Executions}} (already canceled). Since the slot cannot 
be assigned to this slot because it's already canceled, this can fail the 
overall eager scheduling operation. The scheduling result callback will then 
trigger a global fail operation. This can happen before all {{Executions}} have 
been released and, thus, when the {{ExecutionGraph}} is still in the state 
{{CANCELLING}}. The result is that the {{ExecutionGraph}} goes into the state 
{{FAILING}} and then {{FAILED}}.

In order to solve this problem, I propose to keep track of the scheduling 
operation and cancelling the result future when a concurrent {{suspend}}, 
{{cancel}} or {{fail}} call happens.

> Cancel scheduling operation when cancelling the ExecutionGraph
> --------------------------------------------------------------
>
>                 Key: FLINK-8732
>                 URL: https://issues.apache.org/jira/browse/FLINK-8732
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> With the Flip-6 changes and the support for queued scheduling, the 
> {{ExecutionGraph}} must be able to handle cancellation calls when it is not 
> yet fully scheduled. This is for example the case when waiting for new 
> containers.
> A cancellation will cancel all {{Executions}}. As a result, available slots 
> can get assigned to other {{Executions}} (already canceled). Since the slot 
> cannot be assigned to this slot because it's already canceled, this can fail 
> the overall eager scheduling operation. The scheduling result callback will 
> then trigger a global fail operation. This can happen before all 
> {{Executions}} have been released and, thus, when the {{ExecutionGraph}} is 
> still in the state {{CANCELLING}}. The result is that the {{ExecutionGraph}} 
> goes into the state {{FAILING}} and then {{FAILED}}.
> In order to solve this problem, I propose to keep track of the scheduling 
> operation and cancelling the result future when a concurrent {{suspend}}, 
> {{cancel}} or {{fail}} call happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (FLINK-8732) Cancel scheduling operation when cancelling the ExecutionGraph

Reply via email to