Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/20414
Hi, @jiangxb1987 . Could you close this PR?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Merged build finished. Test FAILed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93558/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20414
**[Test build #93558 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93558/testReport)**
for PR 20414 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20414
**[Test build #93558 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93558/testReport)**
for PR 20414 at commit
Github user sameeragarwal commented on the issue:
https://github.com/apache/spark/pull/20414
Thanks @mridulm, all great points!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
Ouch... Yea, we have to think out a way to make it deterministic under hash
collisions.
---
-
To unsubscribe, e-mail:
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
@jiangxb1987 You are correct when the sizes of the map's are same.
But if the map sizes are different, the resulting order can be different -
which can happen when requests for additional memory
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
Hey I searched the `ExternalAppendOnlyMap` and here are the findings:
The `ExternalAppendOnlyMap` claims it keeps the sorted content, but it
actually uses a `HashComparator` that compare the
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
@jiangxb1987 Unfortunately I am unable to analyze this in detail; but
hopefully can give some pointers, which I hope, helps !
One example I can think of is, for shuffle which uses
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
@mridulm I also agree we should follow @sameeragarwal 's suggestion to let
shuffle fetch produce deterministic output, and only do this for a few
operations (e.g. repartition/zipWithIndex, do
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
@shivaram Thinking more, this might affect everything which does a zip (or
variants/similar idioms like limit K, etc) on partition should be affected -
with random + index in coalesce +
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
@cloud-fan Yea you provide a more clear statement here, and I totally agree!
---
-
To unsubscribe, e-mail:
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/20414
> Not quite - coalesce will not combine partitions across executors (aka
shuffle) so you could still end up having many many files.
I'm not sure if I follow here. For `coalesce(1)` Spark
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
@felixcheung You are right that I didn't make it clear there should be
still many shuffle blocks, and if you have the read task retried it should be
slower than using `repartition(1)` directly.
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/20414
> Actually for the first case, you shall use coalesce() instead of
repartition() to get a similar effect, without need of another shuffle!
Not quite - coalesce will not combine partitions
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/20414
Talked to @yanboliang offline, he claimed that the major use cases of
RDD/DataFrame.repartition() in ml workloads he has observed are:
1. During save models, you may need `repartition()` to
Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/20414
@jiangxb1987 @mridulm Could we have a special case of using the sort-based
approach when the RDD type is comparable ? I think that should cover a bunch of
the common cases and the hash version
Github user felixcheung commented on the issue:
https://github.com/apache/spark/pull/20414
Just for context, I'm seeing RDD.repartition being used *a lot*.
---
-
To unsubscribe, e-mail:
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86728/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20414
**[Test build #86728 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)**
for PR 20414 at commit
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/20414
In addition, any use of random in spark code will get affected by this -
unless input is an idempotent source; even if random initialization is done
predictably with the partition index (which we
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/20414
**[Test build #86728 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)**
for PR 20414 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Merged build finished. Test PASSed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/20414
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/304/
26 matches
Mail list logo