[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-11-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-7308:
-
Assignee: Davies Liu

> Should there be multiple concurrent attempts for one stage?
> ---
>
> Key: SPARK-7308
> URL: https://issues.apache.org/jira/browse/SPARK-7308
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Imran Rashid
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.0
>
> Attachments: SPARK-7308_discussion.pdf
>
>
> Currently, when there is a fetch failure, you can end up with multiple 
> concurrent attempts for the same stage.  Is this intended?  At best, it leads 
> to some very confusing behavior, and it makes it hard for the user to make 
> sense of what is going on.  At worst, I think this is cause of some very 
> strange errors we've seen errors we've seen from users, where stages start 
> executing before all the dependent stages have completed.
> This can happen in the following scenario:  there is a fetch failure in 
> attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
> attempt 0 are still running -- some of them can also hit fetch failures after 
> attempt 1 starts.  That will cause additional stage attempts to get fired up.
> There is an attempt to handle this already 
> https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
> but that only checks whether the **stage** is running.  It really should 
> check whether that **attempt** is still running, but there isn't enough info 
> to do that.  
> I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-05-21 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-7308:

Attachment: SPARK-7308_discussion.pdf

problem description and proposed solution, with discussion of some alternatives.

 Should there be multiple concurrent attempts for one stage?
 ---

 Key: SPARK-7308
 URL: https://issues.apache.org/jira/browse/SPARK-7308
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Imran Rashid
Assignee: Imran Rashid
 Attachments: SPARK-7308_discussion.pdf


 Currently, when there is a fetch failure, you can end up with multiple 
 concurrent attempts for the same stage.  Is this intended?  At best, it leads 
 to some very confusing behavior, and it makes it hard for the user to make 
 sense of what is going on.  At worst, I think this is cause of some very 
 strange errors we've seen errors we've seen from users, where stages start 
 executing before all the dependent stages have completed.
 This can happen in the following scenario:  there is a fetch failure in 
 attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
 attempt 0 are still running -- some of them can also hit fetch failures after 
 attempt 1 starts.  That will cause additional stage attempts to get fired up.
 There is an attempt to handle this already 
 https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
 but that only checks whether the **stage** is running.  It really should 
 check whether that **attempt** is still running, but there isn't enough info 
 to do that.  
 I'll also post some info on how to reproduce this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7308) Should there be multiple concurrent attempts for one stage?

2015-05-01 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-7308:

Description: 
Currently, when there is a fetch failure, you can end up with multiple 
concurrent attempts for the same stage.  Is this intended?  At best, it leads 
to some very confusing behavior, and it makes it hard for the user to make 
sense of what is going on.  At worst, I think this is cause of some very 
strange errors we've seen errors we've seen from users, where stages start 
executing before all the dependent stages have completed.

This can happen in the following scenario:  there is a fetch failure in attempt 
0, so the stage is retried.  attempt 1 starts.  But, tasks from attempt 0 are 
still running -- some of them can also hit fetch failures after attempt 1 
starts.  That will cause additional stage attempts to get fired up.

There is an attempt to handle this already 
https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105

but that only checks whether the **stage** is running.  It really should check 
whether that **attempt** is still running, but there isn't enough info to do 
that.

Given the release timeline, I'm going to submit a PR to just fail fast as soon 
as we detect there are multiple concurrent attempts.  Would like some feedback 
from others on whether or not this is a good thing to do.  (The crazy thing is, 
when I reproduce this, spark seems to actually do the right thing despite the 
multiple attempts at the same stage, but I feel like that is probably dumb luck 
from what I've been testing.)

I'll also post some info on how to reproduce this.  Finally, if there really 
shouldn't be multiple concurrent attempts, then we can open another ticket for 
the proper fix (as opposed to just failiing fast) after the 1.4 release.

  was:
Currently, when there is a fetch failure, you can end up with multiple 
concurrent attempts for the same stage.  Is this intended?  At best, it leads 
to some very confusing behavior, and it makes it hard for the user to make 
sense of what is going on.  At worst, I think this is cause of some very 
strange errors we've seen errors we've seen from users, where stages start 
executing before all the dependent stages have completed.

This can happen in the following scenario:  there is a fetch failure in attempt 
0, so the stage is retried.  attempt 1 starts.  But, tasks from attempt 0 are 
still running -- some of them can also hit fetch failures after attempt 1 
starts.  That will cause additional stage attempts to get fired up.

There is an attempt to handle this already 
https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105

but that only checks whether the **stage** is running.  It really should check 
whether that **attempt** is still running, but there isn't enough info to do 
that yet.

Given the release timeline, I'm going to submit a PR to just fail fast as soon 
as we detect there are multiple concurrent attempts.  Would like some feedback 
from others on whether or not this is a good thing to do.  (The crazy thing is, 
when I reproduce this, spark seems to actually do the right thing despite the 
multiple attempts at the same stage, but I feel like that is probably dumb luck 
from what I've been testing.)

I'll also post some info on how to reproduce this.  Finally, if there really 
shouldn't be multiple concurrent attempts, then we can open another ticket for 
the proper fix (as opposed to just failiing fast) after the 1.4 release.


 Should there be multiple concurrent attempts for one stage?
 ---

 Key: SPARK-7308
 URL: https://issues.apache.org/jira/browse/SPARK-7308
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Imran Rashid
Assignee: Imran Rashid

 Currently, when there is a fetch failure, you can end up with multiple 
 concurrent attempts for the same stage.  Is this intended?  At best, it leads 
 to some very confusing behavior, and it makes it hard for the user to make 
 sense of what is going on.  At worst, I think this is cause of some very 
 strange errors we've seen errors we've seen from users, where stages start 
 executing before all the dependent stages have completed.
 This can happen in the following scenario:  there is a fetch failure in 
 attempt 0, so the stage is retried.  attempt 1 starts.  But, tasks from 
 attempt 0 are still running -- some of them can also hit fetch failures after 
 attempt 1 starts.  That will cause additional stage attempts to get fired up.
 There is an attempt to handle this already