[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

Imran Rashid (JIRA) Tue, 01 Sep 2015 08:26:25 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725552#comment-14725552
 ]


Imran Rashid commented on SPARK-2666:
-------------------------------------

I'm copying [~kayousterhout]'s comment from the PR here for discussion:

bq. My understanding is that it can help to let the remaining tasks run -- 
because they may hit Fetch failures from different map outputs than the 
original fetch failure, which will lead to the DAGScheduler to more quickly 
reschedule all of the failed tasks. For example, if an executor failed and had 
multiple map outputs on it, the first Fetch failure will only tell us about one 
of the map outputs being missing, and it's helpful to learn about all of them 
before we resubmit the earlier stage. Did you already think about this / am I 
misunderstanding the issue?

Things may have changed in the meantime, but I'm pretty sure that now, when 
there is a fetch failure, spark assumes its lost *all* of the map output for 
that host.  Its a bit confusing -- it seems we first only remove [the one map 
output with the 
failure|https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1134]
 but then we remove all map outputs in [{{handleExecutorLost}} | 
https://github.com/apache/spark/blob/391e6be0ae883f3ea0fab79463eb8b618af79afb/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1184].
  I suppose it could still be useful to run the remaining tasks, as they may 
discover *another* executor that has died, but I don't think its worth it just 
for that, right?

Elsewhere we've also discussed always killing all tasks as soon as the 
{{TaskSetManager}} is marked as a zombie, see 
https://github.com/squito/spark/pull/4.

I'm particularly interested b/c this is relevant to SPARK-10370.  In that case, 
there wouldn't be any benefit to leaving tasks as running after marking the 
stage as zombie.  If we do want to cancel all tasks as soon as we mark a stage 
as zombie, then I'd prefer we go the route of making {{isZombie}} private, and 
make task cancellation part of {{markAsZombie}} to make the code easier to 
follow and make sure we always cancel tasks.

Is my understanding correct?  Other opinions on the right approach here?

> when task is FetchFailed cancel running tasks of failedStage
> ------------------------------------------------------------
>
>                 Key: SPARK-2666
>                 URL: https://issues.apache.org/jira/browse/SPARK-2666
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Lianhui Wang
>
> in DAGScheduler's handleTaskCompletion,when reason of failed task is 
> FetchFailed, cancel running tasks of failedStage before add failedStage to 
> failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

Reply via email to