[jira] [Commented] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

Apache Spark (JIRA) Tue, 14 Mar 2017 15:08:01 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925125#comment-15925125
 ]


Apache Spark commented on SPARK-14649:
--------------------------------------

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/17297

> DagScheduler re-starts all running tasks on fetch failure
> ---------------------------------------------------------
>
>                 Key: SPARK-14649
>                 URL: https://issues.apache.org/jira/browse/SPARK-14649
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>            Reporter: Sital Kedia
>
> When a fetch failure occurs, the DAGScheduler re-launches the previous stage 
> (to re-generate output that was missing), and then re-launches all tasks in 
> the stage with the fetch failure that hadn't *completed* when the fetch 
> failure occurred (the DAGScheduler re-lanches all of the tasks whose output 
> data is not available -- which is equivalent to the set of tasks that hadn't 
> yet completed).
> The assumption when this code was originally written was that when a fetch 
> failure occurred, the output from at least one of the tasks in the previous 
> stage was no longer available, so all of the tasks in the current stage would 
> eventually fail due to not being able to access that output.  This assumption 
> does not hold for some large-scale, long-running workloads.  E.g., there's 
> one use case where a job has ~100k tasks that each run for about 1 hour, and 
> only the first 5-10 minutes are spent fetching data.  Because of the large 
> number of tasks, it's very common to see a few tasks fail in the fetch phase, 
> and it's wasteful to re-run other tasks that had finished fetching data so 
> aren't affected by the fetch failure (and may be most of the way through 
> their hour-long execution).  The DAGScheduler should not re-start these tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14649) DagScheduler re-starts all running tasks on fetch failure

Reply via email to