[ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14531969#comment-14531969
 ] 

Apache Spark commented on SPARK-7308:
-------------------------------------

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/5964

> Should there be multiple concurrent attempts for one stage?
> -----------------------------------------------------------
>
>                 Key: SPARK-7308
>                 URL: https://issues.apache.org/jira/browse/SPARK-7308
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.
> Given the release timeline, I'm going to submit a PR to just fail fast as 
> soon as we detect there are multiple concurrent attempts.  Would like some 
> feedback from others on whether or not this is a good thing to do.  (The 
> crazy thing is, when I reproduce this, spark seems to actually do the right 
> thing despite the multiple attempts at the same stage, but I feel like that 
> is probably dumb luck from what I've been testing.)
> I'll also post some info on how to reproduce this.  Finally, if there really 
> shouldn't be multiple concurrent attempts, then we can open another ticket 
> for the proper fix (as opposed to just failiing fast) after the 1.4 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to