GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/7988

    [SQL] [WIP] Demonstration of issue in Exchange planning

    This pull request serves to demonstrate a bug / issue in Exchange:
    
    Consider SortMergeJoin, which requires a sorted, clustered distribution of 
its input rows. Say that both of SMJ's children produce unsorted output but are 
both single partition. In this case, we will need to inject sort operators but 
should not need to inject exchanges. Unfortunately, it looks like the Exchange 
unnecessarily repartitions using a hash partitioning.
    
    I thought that this might be caused by the fact that `Repartition` does not 
declare a proper `outputPartitioning`, but fixing that did not prevent the 
unnecessary shuffle from taking place.
    
    #7959 would end up addressing this issue by dropping the `Repartition`, but 
I'm actually slightly concerned that `Exchange` even gets planned in the first 
place in this scenario.
    
    I think that the problem here may be the fact that `Exchange` reacts to 
either an unsatisfied ordering requirement _or_ a unsatisfied distribution 
requirement by choosing a target partitioning then checking whether it's 
"guaranteed" by the existing partitioning.  I think that it may make more sense 
to only look at picking a new partitioning when distributions are unsatisfied, 
not when only the ordering is unsatisfied.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark exchange-fixes

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7988.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7988
    
----
commit 2dfc64811abcfb9a36ae5a2cf5060a5a28ef726b
Author: Josh Rosen <joshro...@databricks.com>
Date:   2015-08-06T07:05:35Z

    Add failing test illustrating bad exchange planning.

commit cc5669cd1cfd81df30d00901c65c79bc50c8d448
Author: Josh Rosen <joshro...@databricks.com>
Date:   2015-08-06T07:05:55Z

    Adding outputPartitioning to Repartition does not fix the test.

commit 067595637b3fea6eba43b43a87e274bb60e846e2
Author: Josh Rosen <joshro...@databricks.com>
Date:   2015-08-06T07:06:23Z

    Preserving ordering and partitioning in row format converters also does not 
help.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to