Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
@jisookim0513 - created a new PR -
https://github.com/apache/spark/pull/17297
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
>> Also, separately from what approach is used, how do you deal with the
following: suppose map task 1 loses its output (e.g., the reducer where that
task is located dies). Now, suppose reduce
Github user jisookim0513 commented on the issue:
https://github.com/apache/spark/pull/12436
@sitalkedia have you had a chance to work on this issue and open a new PR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
@kayousterhout - Thanks for taking a look at the PR. Currently I don't
have time to work on it. I will close the PR and open a new PR with issues
addressed.
---
If your project is set up for
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/12436
@sitalkedia this has been inactive for a while and there were a few issues
pointed out above that haven't yet been resolved. Do you have time to work on
this? Otherwise, can you close the
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/12436
Yeah @mridulm that also seems like an issue with this approach.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/12436
I am curious how this is resilient to epoch changes which will be triggered
due to executor loss for a shuffle task when its shuffle map task executor is
gone.
Wont it not create issues if
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/12436
@sitalkedia I was thinking about this over the weekend and I'm not sure
this is the right approach. I suspect it might be better to re-use the same
task set manager for the new stage. This
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
@davies - Thanks for looking into this. Updated the PR description with
details of the change. Let me know if the approach seem reasonable, I will work
on rebasing the change against latest
Github user davies commented on the issue:
https://github.com/apache/spark/pull/12436
@sitalkedia Have a quick look at this one, the use case sounds good, we
should improve the stability for long running tasks. Could you explain a bit
more how the current patch works? (in the PR
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
ping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/12436
See https://issues.apache.org/jira/browse/SPARK-17064
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
@rxin - The idea is not to rerun or kill already running tasks in case of
fetch failure because they might finish. If those tasks end up failing later,
the dag scheduler will rerun them.
---
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/12436
Is the idea here to not rerun jobs that are already running in the case of
a fetch failure, because they might finish?
What happens after the change if those tasks end up coming back as
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
ping.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the
Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/12436
@kayousterhout - Our use case is very large workload on Spark. We are
processing around 100TBs of data in a single Spark job with 100k tasks in it
(BTW the single threaded DagScheduler is
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/12436
@sitalkedia What's the use case for this? In the cases I've seen, if
there's one fetch failure, it typically means that a machine that ran a map
task has failed / gone done / been revoked by
17 matches
Mail list logo