[ 
https://issues.apache.org/jira/browse/FLINK-10644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760004#comment-16760004
 ] 

Greg Hogan commented on FLINK-10644:
------------------------------------

I'm not so sure that speculative execution is a good fit for Apache Flink. In 
MapReduce there are two concepts conducive to speculative execution not present 
in Flink:

1) In MapReduce map/reduce/task to mapper/reducer/container ratio is often 10:1 
or higher. In Flink all tasks are immediately assigned and processed in 
parallel.

2) In MapReduce intermediate and output data is always persisted whereas in 
Flink only state is persisted (and only in streaming). Input is assumed to be 
replayable but speculative execution would presumably also work for 
intermediate tasks.

 

As noted, Spark has included speculative execution and the Spark processing 
model is closer to Flink's. I'm just not clear on the circumstances where it is 
beneficial to start a catch-up task so late. I haven't followed the work on 
unification of batch and streaming but it seems more valuable to focus on 
transition a task from a straggler machine rather than start that task over.

> Batch Job: Speculative execution
> --------------------------------
>
>                 Key: FLINK-10644
>                 URL: https://issues.apache.org/jira/browse/FLINK-10644
>             Project: Flink
>          Issue Type: New Feature
>          Components: JobManager
>            Reporter: JIN SUN
>            Assignee: ryantaocer
>            Priority: Major
>             Fix For: 1.8.0
>
>
> Strugglers/outlier are tasks that run slower than most of the all tasks in a 
> Batch Job, this somehow impact job latency, as pretty much this straggler 
> will be in the critical path of the job and become as the bottleneck.
> Tasks may be slow for various reasons, including hardware degradation, or 
> software mis-configuration, or noise neighboring. It's hard for JM to predict 
> the runtime.
> To reduce the overhead of strugglers, other system such as Hadoop/Tez, Spark 
> has *_speculative execution_*. Speculative execution is a health-check 
> procedure that checks for tasks to be speculated, i.e. running slower in a 
> ExecutionJobVertex than the median of all successfully completed tasks in 
> that EJV, Such slow tasks will be re-submitted to another TM. It will not 
> stop the slow tasks, but run a new copy in parallel. And will kill the others 
> if one of them complete.
> This JIRA is an umbrella to apply this kind of idea in FLINK. Details will be 
> append later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to