[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-29 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @sameeragarwal Interesting, this is still assuming that shuffle (after fetch) is stable, right ? Is this gauranteed in face of memory pressure/spills ? ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-29 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 @mridulm one approach that Xingbo is looking into (independently of https://github.com/apache/spark/pull/20414) is to have the `ShuffleBlockFetcherIterator` remember the order of blocks it

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-27 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/20393 @sameeragarwal I am not sure if we can make shuffle fetch deterministic - without quite a lot of perf overhead; do you have any thoughts on how to do this in case I am missing something here ?

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread shivaram
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/20393 I'm fine with merging this -- I just dont want to this issue to be forgotten for RDDs as I think its a major correctness issue. @mridulm @sameeragarwal Lets continue the discussion on

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 I opened https://issues.apache.org/jira/browse/SPARK-23243 to track the RDD.repartition() patch, thanks for all the discussions! @shivaram @mridulm @sameeragarwal @gatorsmile ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 LGTM but we should get a broader consensus on this. In the meantime, I'm merging this patch to master/2.3. --- - To

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20393 Updated the title, does it sound good to have this PR? I'll open another one to address the RDD.repartition() issue (which will target to 2.4). ---

[GitHub] spark issue #20393: [SPARK-23207][SQL] Shuffle+Repartition on a DataFrame co...

2018-01-26 Thread sameeragarwal
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20393 Another (possibly cleaner) approach here would be to make the shuffle block fetch order deterministic but I agree that it might not be safe to include it in 2.3 this late. ---