Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132799456
@shivaram did you create a JIRA for making this affect only ShuffledRDD? I
might do it as part of https://issues.apache.org/jira/browse/SPARK-9852, which
I'm working on
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132804519
Not yet - I was hoping to keep SPARK-10087 open, but I guess thats closed
now. Doing it as a part of SPARK-9852 sounds good to me. Let me know if you
want me to review
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132322822
cc @mateiz who has also been looking at this code recently
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132320966
Could you provide some more information about the map output ? The reducer
locality should not kick in unless a certain map output location has more than
20% of the
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132325480
Thanks for the info -- And just to confirm, is everything getting assigned
to Executor ID 23 (10.0.145.27) in the reduce stage ?
---
If your project is set up for it,
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132320481
cc @shivaram
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132324006
The reduce stage i has a 2-way join in it. The two map stages had 30 and 1
tasks, respectively. For the stage having 30 tasks, here is the screenshot of
task info
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132328891
ah, sorry i missed the reducer stage's screenshot. Yes, executor 23 was the
one got all reduce tasks.
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132331662
So my hypothesis right now is that the RDD in the reduce stage has two
Shuffle dependencies and the first shuffle dependency happens to be the single
map task stage --
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132346850
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132346765
[Test build #41149 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41149/console)
for PR 8280 at commit
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132345128
The diff I'm proposing is something like
```
+val numShuffleDeps =
rdd.dependencies.filter(_.isInstanceOf[ShuffleDependency[_, _, _]]).length
+
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132415909
Ok - lets leave it on in master and I'll work with @mateiz on changes to
move this to ShuffleRDD and capture more use cases. @yhuai could you put in the
query you ran
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132415200
Let's close this one.
@shivaram can you submit a proper fix for master?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user yhuai closed the pull request at:
https://github.com/apache/spark/pull/8280
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132417459
@shivaram Sure. Just updated the JIRA description.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132346849
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132401226
Why don't we turn it on in master but off in 1.5? At this point in the 1.5
cycle, I'm worry about potential bugs this would cause after more fixes.
---
If your project
Github user yhuai commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132414515
I created https://github.com/apache/spark/pull/8296 to change the default
setting to false for branch 1.5.
---
If your project is set up for it, you can reply to this
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132394991
It does sound good to turn it off if there are multiple dependencies.
However, an even better solution may be to move this into ShuffledRDD, so that
we control where
Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132408576
Sorry just too risky right now for 1.5.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132395677
BTW it may also be fine to turn it off by default for 1.5, but in general,
with these things, there's not much point having them in the code if they're
off by default.
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/8280#issuecomment-132403639
But to Matei's point we don't get feedback if its on in the master branch
as I guess many more people use a release. I think turning it off for the
multiple dependency
23 matches
Mail list logo