[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-09-11 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20414 Hi, @jiangxb1987 . Could you close this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-07-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-07-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93558/ Test FAILed. ---

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-07-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20414 **[Test build #93558 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93558/testReport)** for PR 20414 at commit

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-07-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20414 **[Test build #93558 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93558/testReport)** for PR 20414 at commit

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20414 Thanks @mridulm, all great points! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 Ouch... Yea, we have to think out a way to make it deterministic under hash collisions. --- - To unsubscribe, e-mail:

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20414 @jiangxb1987 You are correct when the sizes of the map's are same. But if the map sizes are different, the resulting order can be different - which can happen when requests for additional memory

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 Hey I searched the `ExternalAppendOnlyMap` and here are the findings: The `ExternalAppendOnlyMap` claims it keeps the sorted content, but it actually uses a `HashComparator` that compare the

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20414 @jiangxb1987 Unfortunately I am unable to analyze this in detail; but hopefully can give some pointers, which I hope, helps ! One example I can think of is, for shuffle which uses

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 @mridulm I also agree we should follow @sameeragarwal 's suggestion to let shuffle fetch produce deterministic output, and only do this for a few operations (e.g. repartition/zipWithIndex, do

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-29 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20414 @shivaram Thinking more, this might affect everything which does a zip (or variants/similar idioms like limit K, etc) on partition should be affected - with random + index in coalesce +

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 @cloud-fan Yea you provide a more clear statement here, and I totally agree! --- - To unsubscribe, e-mail:

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20414 > Not quite - coalesce will not combine partitions across executors (aka shuffle) so you could still end up having many many files. I'm not sure if I follow here. For `coalesce(1)` Spark

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 @felixcheung You are right that I didn't make it clear there should be still many shuffle blocks, and if you have the read task retried it should be slower than using `repartition(1)` directly.

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread felixcheung
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20414 > Actually for the first case, you shall use coalesce() instead of repartition() to get a similar effect, without need of another shuffle! Not quite - coalesce will not combine partitions

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20414 Talked to @yanboliang offline, he claimed that the major use cases of RDD/DataFrame.repartition() in ml workloads he has observed are: 1. During save models, you may need `repartition()` to

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-28 Thread shivaram
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20414 @jiangxb1987 @mridulm Could we have a special case of using the sort-based approach when the RDD type is comparable ? I think that should cover a bunch of the common cases and the hash version

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread felixcheung
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/20414 Just for context, I'm seeing RDD.repartition being used *a lot*. --- - To unsubscribe, e-mail:

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86728/ Test PASSed. ---

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20414 **[Test build #86728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)** for PR 20414 at commit

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20414 In addition, any use of random in spark code will get affected by this - unless input is an idempotent source; even if random initialization is done predictably with the partition index (which we

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20414 **[Test build #86728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86728/testReport)** for PR 20414 at commit

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20414: [SPARK-23243][SQL] Shuffle+Repartition on an RDD could l...

2018-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20414 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/304/