[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

Ishiihara Sat, 08 Nov 2014 21:32:20 -0800

GitHub user Ishiihara opened a pull request:

    https://github.com/apache/spark/pull/3173


    [SPARK-2213][SQL] Sort Merge Join 

    This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin 
operator is similar to Hive's Sort merge bucket join.  
    
    MergeJoin operator relies on SortBasedShuffle to create partitions that 
sorted by the join key. In each partition, we merge the two child iterators. 
The tricky part in merge step is handling duplicate join keys. To handle 
duplicate keys, we use a buffer to store all matching elements in right 
iterator for a certain join key. The buffer is used for generating join tuples 
when the join key of the next left element is the same as the current join key. 
    
    MergeJoin reduces extra memory consumption, in the current implementation, 
MergeJoin only needs memory that can hold elements of the key that has the most 
duplicates in right iterator. 
    
    For query optimization,  we may resolve to MergeJoin when both relations 
are large and neither of the two can fit in memory. Currently, this heuristic 
is not added to optimizer. I would appreciate if you can add comments on how to 
resolve to MergeJoin in optimizer. 
     
    Currently, MergeJoin only supports inner join. However, it can be extended 
to support outer join. Will handle outer join in separate PRs. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ishiihara/spark SparkSQL-merge

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3173
    
----
commit 1c41f6f248f1145c7d730129795e50bdd8a53f2b
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-10-28T23:47:35Z

    initial commit

commit dc6a6840e2d2b1681e70a6a3eeb10d7a9e6437ce
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-10-29T00:17:59Z

    add MergeJoin.scala

commit f5ef4624aea5304ffdcc8daf5fbebc20943c3cf4
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-11-09T04:05:56Z

    Merge join working

commit b13cc4526f0098386b64bce50b2f983f95709f23
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-11-09T04:09:11Z

    Merge remote-tracking branch 'upstream/master' into SparkSQL-merge
    
    Conflicts:
        sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala

commit d6b6e7b8194682c713400823a4fd17e0419d89e4
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-11-09T04:51:04Z

    add inline comments for merge join

commit 837eb081e6382a23b4fd67a5265188aab1c7e305
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-11-09T05:02:17Z

    use merge join as inner join operator in JoinSuite

commit 5cb98c306f76183e4148d9b0a6b0a8ce4d58368e
Author: Liquan Pei <liquan...@gmail.com>
Date:   2014-11-09T05:30:52Z

    improve inline comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-2213][SQL] Sort Merge Join

Reply via email to