[ 
https://issues.apache.org/jira/browse/SPARK-22270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22270:
---------------------------------
    Labels: bulk-closed  (was: )

> Renaming DF column breaks sparkPlan.outputOrdering
> --------------------------------------------------
>
>                 Key: SPARK-22270
>                 URL: https://issues.apache.org/jira/browse/SPARK-22270
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.2.0
>            Reporter: Yuri Bogomolov
>            Priority: Major
>              Labels: bulk-closed
>
> Renaming columns doesn't update ordering/distribution metadata. This may 
> cause unnecessary data shuffles, and significantly affect performance.
> {code:java}
> val df = spark.sqlContext.range(0, 10)
> val sorted = df.sort("id")
> val renamed = sorted.withColumnRenamed("id", "id2")
> val sortedAgain = renamed.sort("id2")
> sortedAgain.explain(true)
> == Analyzed Logical Plan ==
> id2: bigint
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Sort [id2#6L ASC NULLS FIRST], true
> +- Project [id#0L AS id2#6L]
>    +- Sort [id#0L ASC NULLS FIRST], true
>       +- Range (0, 10, step=1, splits=Some(4))
> == Physical Plan ==
> *Sort [id2#6L ASC NULLS FIRST], true, 0
> +- Exchange rangepartitioning(id2#6L ASC NULLS FIRST, 200)
>    +- *Project [id#0L AS id2#6L]
>       +- *Sort [id#0L ASC NULLS FIRST], true, 0
>          +- Exchange rangepartitioning(id#0L ASC NULLS FIRST, 200)
>             +- *Range (0, 10, step=1, splits=4)
> {code}
> You can see that the dataset is going to be sorted twice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to