[ https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523783#comment-14523783 ]
Apache Spark commented on SPARK-7308: ------------------------------------- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/5844 > Should there be multiple concurrent attempts for one stage? > ----------------------------------------------------------- > > Key: SPARK-7308 > URL: https://issues.apache.org/jira/browse/SPARK-7308 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.3.1 > Reporter: Imran Rashid > Assignee: Imran Rashid > > Currently, when there is a fetch failure, you can end up with multiple > concurrent attempts for the same stage. Is this intended? At best, it leads > to some very confusing behavior, and it makes it hard for the user to make > sense of what is going on. At worst, I think this is cause of some very > strange errors we've seen errors we've seen from users, where stages start > executing before all the dependent stages have completed. > This can happen in the following scenario: there is a fetch failure in > attempt 0, so the stage is retried. attempt 1 starts. But, tasks from > attempt 0 are still running -- some of them can also hit fetch failures after > attempt 1 starts. That will cause additional stage attempts to get fired up. > There is an attempt to handle this already > https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105 > but that only checks whether the **stage** is running. It really should > check whether that **attempt** is still running, but there isn't enough info > to do that. > Given the release timeline, I'm going to submit a PR to just fail fast as > soon as we detect there are multiple concurrent attempts. Would like some > feedback from others on whether or not this is a good thing to do. (The > crazy thing is, when I reproduce this, spark seems to actually do the right > thing despite the multiple attempts at the same stage, but I feel like that > is probably dumb luck from what I've been testing.) > I'll also post some info on how to reproduce this. Finally, if there really > shouldn't be multiple concurrent attempts, then we can open another ticket > for the proper fix (as opposed to just failiing fast) after the 1.4 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org