GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/6772
[SPARK-8317] Do not push sort into shuffle in Exchange operator In some cases, Spark SQL pushes sorting operations into the shuffle layer by specifying a key ordering as part of the shuffle dependency. I think that we should not do this: - Since we do not delegate aggregation to Spark's shuffle, specifying the keyOrdering as part of the shuffle has no effect on the shuffle map side. - By performing the shuffle ourselves (by inserting a sort operator after the shuffle instead), we can use the Exchange planner to choose specialized sorting implementations based on the types of rows being sorted. - We can remove some complexity from SqlSerializer2 by not requiring it to know about sort orderings, since SQL's own sort operators will already perform the necessary defensive copying. This patch removes Exchange's `canSortWithShuffle` path and the associated code in `SqlSerializer2`. Shuffles that used to go through the `canSortWithShuffle` path would always wind up using Spark's `ExternalSorter` (inside of `HashShuffleReader`); to avoid a performance regression as a result of handling these shuffles ourselves, I've changed the SQLConf defaults so that external sorting is enabled by default. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark SPARK-8317 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6772.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6772 ---- commit bf3b4c875555079a43807927fbceb75c5f03d08b Author: Josh Rosen <joshro...@databricks.com> Date: 2015-06-12T00:07:08Z Enable external sort by default commit ebf9c0f5c57ebda64aeb676e28424482150e0fab Author: Josh Rosen <joshro...@databricks.com> Date: 2015-06-12T00:07:47Z Do not push sort into shuffle in Exchange operator ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org