[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661348318 For the following, I'd like to ask your help if you are interested. I believe we want to build the better Apache Spark in the community together. > If you generalize the

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661345335 For the following, I added SPARK-32318 added a test coverage at master/3.0/2.4. Are you suggesting that's not enough? > Finally I do want to point out that there is no m

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-20 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-661344352 @hvanhovell . The following is complete wrong because the above optimization was one of the recommendations for many Hortonworks customers to save their HDFS usage. I knew

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-19 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-660693225 cc @cloud-fan and @gatorsmile once more. This is an automated message from the Apache Git Service. To resp

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-19 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-660690119 Retest this please. This is an automated message from the Apache Git Service. To respond to the message, p

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-659147381 Thank you for quick updating, @aokolnychyi . Also, thank you all for your opinions. This is an automated

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658831236 BTW, @aokolnychyi . I merged the corner case test case. Could you rebase this to the master? Then, we can discuss how to proceed this PR in a narrowed direction. ---

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658830542 cc @marmbrus and @gatorsmile since they know the existing customers well and are good at protecting them.

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-15 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658828708 @hvanhovell . I agree with you for the followings. > AFAIK nested ordering can be ignored from a relation algebra point of view. > Regarding the shuffles. ...

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658559706 No~ It depends on file formats instead of Spark side. For example, in the above example, ORC files are small because it supports a special encoding when the data is sort

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658558813 I made a PR to add a test coverage for the above case. - https://github.com/apache/spark/pull/29118 Thi

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658550248 Very sorry, guys. Due to the above regression, I'll revert this commit urgently. We can rethink about this PR. ---

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658549984 **AFTER SPARK-32276** ``` scala> scala.util.Random.shuffle((1 to 10).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") scala

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658544475 To generate small Parquet/ORC files, we do the above tricks, don't we? This is an automated message from t

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658543717 Oops. Sorry, guys. It seems that I missed something during testing. For the following case, we should not remove `Sort`. **BEFORE THIS PR** ```scala scala> Seq

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-14 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-658538140 Also, cc @gatorsmile and @cloud-fan This is an automated message from the Apache Git Service. To respond

[GitHub] [spark] dongjoon-hyun commented on pull request #29089: [SPARK-32276][SQL] Remove redundant sorts before repartition nodes

2020-07-13 Thread GitBox
dongjoon-hyun commented on pull request #29089: URL: https://github.com/apache/spark/pull/29089#issuecomment-657713339 Thank you for pinging me, @aokolnychyi . This is an automated message from the Apache Git Service. To resp