GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/7988
[SQL] [WIP] Demonstration of issue in Exchange planning This pull request serves to demonstrate a bug / issue in Exchange: Consider SortMergeJoin, which requires a sorted, clustered distribution of its input rows. Say that both of SMJ's children produce unsorted output but are both single partition. In this case, we will need to inject sort operators but should not need to inject exchanges. Unfortunately, it looks like the Exchange unnecessarily repartitions using a hash partitioning. I thought that this might be caused by the fact that `Repartition` does not declare a proper `outputPartitioning`, but fixing that did not prevent the unnecessary shuffle from taking place. #7959 would end up addressing this issue by dropping the `Repartition`, but I'm actually slightly concerned that `Exchange` even gets planned in the first place in this scenario. I think that the problem here may be the fact that `Exchange` reacts to either an unsatisfied ordering requirement _or_ a unsatisfied distribution requirement by choosing a target partitioning then checking whether it's "guaranteed" by the existing partitioning. I think that it may make more sense to only look at picking a new partitioning when distributions are unsatisfied, not when only the ordering is unsatisfied. You can merge this pull request into a Git repository by running: $ git pull https://github.com/JoshRosen/spark exchange-fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7988.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7988 ---- commit 2dfc64811abcfb9a36ae5a2cf5060a5a28ef726b Author: Josh Rosen <joshro...@databricks.com> Date: 2015-08-06T07:05:35Z Add failing test illustrating bad exchange planning. commit cc5669cd1cfd81df30d00901c65c79bc50c8d448 Author: Josh Rosen <joshro...@databricks.com> Date: 2015-08-06T07:05:55Z Adding outputPartitioning to Repartition does not fix the test. commit 067595637b3fea6eba43b43a87e274bb60e846e2 Author: Josh Rosen <joshro...@databricks.com> Date: 2015-08-06T07:06:23Z Preserving ordering and partitioning in row format converters also does not help. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org