[
https://issues.apache.org/jira/browse/SPARK-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Imran Rashid updated SPARK-7308:
Description:
Currently, when there is a fetch failure, you can end up with multiple
concurrent attempts for the same stage. Is this intended? At best, it leads
to some very confusing behavior, and it makes it hard for the user to make
sense of what is going on. At worst, I think this is cause of some very
strange errors we've seen errors we've seen from users, where stages start
executing before all the dependent stages have completed.
This can happen in the following scenario: there is a fetch failure in attempt
0, so the stage is retried. attempt 1 starts. But, tasks from attempt 0 are
still running -- some of them can also hit fetch failures after attempt 1
starts. That will cause additional stage attempts to get fired up.
There is an attempt to handle this already
https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
but that only checks whether the **stage** is running. It really should check
whether that **attempt** is still running, but there isn't enough info to do
that.
Given the release timeline, I'm going to submit a PR to just fail fast as soon
as we detect there are multiple concurrent attempts. Would like some feedback
from others on whether or not this is a good thing to do. (The crazy thing is,
when I reproduce this, spark seems to actually do the right thing despite the
multiple attempts at the same stage, but I feel like that is probably dumb luck
from what I've been testing.)
I'll also post some info on how to reproduce this. Finally, if there really
shouldn't be multiple concurrent attempts, then we can open another ticket for
the proper fix (as opposed to just failiing fast) after the 1.4 release.
was:
Currently, when there is a fetch failure, you can end up with multiple
concurrent attempts for the same stage. Is this intended? At best, it leads
to some very confusing behavior, and it makes it hard for the user to make
sense of what is going on. At worst, I think this is cause of some very
strange errors we've seen errors we've seen from users, where stages start
executing before all the dependent stages have completed.
This can happen in the following scenario: there is a fetch failure in attempt
0, so the stage is retried. attempt 1 starts. But, tasks from attempt 0 are
still running -- some of them can also hit fetch failures after attempt 1
starts. That will cause additional stage attempts to get fired up.
There is an attempt to handle this already
https://github.com/apache/spark/blob/16860327286bc08b4e2283d51b4c8fe024ba5006/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1105
but that only checks whether the **stage** is running. It really should check
whether that **attempt** is still running, but there isn't enough info to do
that yet.
Given the release timeline, I'm going to submit a PR to just fail fast as soon
as we detect there are multiple concurrent attempts. Would like some feedback
from others on whether or not this is a good thing to do. (The crazy thing is,
when I reproduce this, spark seems to actually do the right thing despite the
multiple attempts at the same stage, but I feel like that is probably dumb luck
from what I've been testing.)
I'll also post some info on how to reproduce this. Finally, if there really
shouldn't be multiple concurrent attempts, then we can open another ticket for
the proper fix (as opposed to just failiing fast) after the 1.4 release.
Should there be multiple concurrent attempts for one stage?
---
Key: SPARK-7308
URL: https://issues.apache.org/jira/browse/SPARK-7308
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.3.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Currently, when there is a fetch failure, you can end up with multiple
concurrent attempts for the same stage. Is this intended? At best, it leads
to some very confusing behavior, and it makes it hard for the user to make
sense of what is going on. At worst, I think this is cause of some very
strange errors we've seen errors we've seen from users, where stages start
executing before all the dependent stages have completed.
This can happen in the following scenario: there is a fetch failure in
attempt 0, so the stage is retried. attempt 1 starts. But, tasks from
attempt 0 are still running -- some of them can also hit fetch failures after
attempt 1 starts. That will cause additional stage attempts to get fired up.
There is an attempt to handle this already