[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-26 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @jiangxb1987 Btw, we could argue this is a correctness issue since we added repartition - so not necessarily blocker :-) --- -

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-26 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @jiangxb1987 Other than hash partitioning, I dont see how this can be handled reliably ... You are right, this is a basic correctness issue - unfortunately I never used this family of methods

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-26 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 Another simple way to ensure correctness of RDD.repartition() is to do HashPartitioning instead of current RoundRobinPartitioning, but that will lead to regression when you have skew input

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-26 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 Actually the similar approach cannot apply to fix RDD.repartition(), as in RDD[T], the data type `T` can be non-comparable, so we are not able to perform a local sort before actually

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86682/ Test PASSed. ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86682 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86682/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86682 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86682/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/266/

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86668/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86668/ Test FAILed. ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 I added TODO on this, so we may have this for now and I'll continue working on the RDD path. --- - To unsubscribe, e-mail:

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86668 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86668/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/255/

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread shivaram
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20393 @sameeragarwal I think we should wait for the RDD fix for 2.3 as well ? --- - To unsubscribe, e-mail:

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 LGTM, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 Yes, this bug also applies to RDD repartition but the current fix doesn't cover this (the local sort approach would be quite similar but it'll be a completely different codepath).

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread shivaram
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20393 @jiangxb1987 If I'm not wrong this problem will also happen with RDD repartition ? Will this fix also cover that ? --- - To

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86639/ Test PASSed. ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86639/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86639/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/230/

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread viirya
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20393 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail:

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86635/ Test FAILed. ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86635/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20393 **[Test build #86635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86635/testReport)** for PR 20393 at commit

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/226/

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on an RDD/DataFra...

2018-01-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20393 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional