[ 
https://issues.apache.org/jira/browse/FLINK-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407866#comment-15407866
 ] 

Stephan Ewen commented on FLINK-4256:
-------------------------------------

[~wenlong.lwl] True, one has to find the entire "connected component" for 
restart. That one is, however, dynamic, so I would not pre-compute it:
  - We may introduce best-effort caching that means in some cases the program 
must backtrack further, in others less
  - Downstream canceling is only necessary if an input has already been 
supplied to the downstream task. Especially in batch, this is often not the 
case and can reduce the tasks to look at for canceling.

We can make this quite a bit more efficient in my opinion by operating on 
{{ExecutionJobVertex}} level for many cases, rather than on each individual 
vertex.

> Fine-grained recovery
> ---------------------
>
>                 Key: FLINK-4256
>                 URL: https://issues.apache.org/jira/browse/FLINK-4256
>             Project: Flink
>          Issue Type: Improvement
>          Components: JobManager
>    Affects Versions: 1.1.0
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>             Fix For: 1.2.0
>
>
> When a task fails during execution, Flink currently resets the entire 
> execution graph and triggers complete re-execution from the last completed 
> checkpoint. This is more expensive than just re-executing the failed tasks.
> In many cases, more fine-grained recovery is possible.
> The full description and design is in the corresponding FLIP.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1+%3A+Fine+Grained+Recovery+from+Task+Failures



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to