GitHub user Ishiihara opened a pull request: https://github.com/apache/spark/pull/3173
[SPARK-2213][SQL] Sort Merge Join This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin operator is similar to Hive's Sort merge bucket join. MergeJoin operator relies on SortBasedShuffle to create partitions that sorted by the join key. In each partition, we merge the two child iterators. The tricky part in merge step is handling duplicate join keys. To handle duplicate keys, we use a buffer to store all matching elements in right iterator for a certain join key. The buffer is used for generating join tuples when the join key of the next left element is the same as the current join key. MergeJoin reduces extra memory consumption, in the current implementation, MergeJoin only needs memory that can hold elements of the key that has the most duplicates in right iterator. For query optimization, we may resolve to MergeJoin when both relations are large and neither of the two can fit in memory. Currently, this heuristic is not added to optimizer. I would appreciate if you can add comments on how to resolve to MergeJoin in optimizer. Currently, MergeJoin only supports inner join. However, it can be extended to support outer join. Will handle outer join in separate PRs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/Ishiihara/spark SparkSQL-merge Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3173.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3173 ---- commit 1c41f6f248f1145c7d730129795e50bdd8a53f2b Author: Liquan Pei <liquan...@gmail.com> Date: 2014-10-28T23:47:35Z initial commit commit dc6a6840e2d2b1681e70a6a3eeb10d7a9e6437ce Author: Liquan Pei <liquan...@gmail.com> Date: 2014-10-29T00:17:59Z add MergeJoin.scala commit f5ef4624aea5304ffdcc8daf5fbebc20943c3cf4 Author: Liquan Pei <liquan...@gmail.com> Date: 2014-11-09T04:05:56Z Merge join working commit b13cc4526f0098386b64bce50b2f983f95709f23 Author: Liquan Pei <liquan...@gmail.com> Date: 2014-11-09T04:09:11Z Merge remote-tracking branch 'upstream/master' into SparkSQL-merge Conflicts: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala commit d6b6e7b8194682c713400823a4fd17e0419d89e4 Author: Liquan Pei <liquan...@gmail.com> Date: 2014-11-09T04:51:04Z add inline comments for merge join commit 837eb081e6382a23b4fd67a5265188aab1c7e305 Author: Liquan Pei <liquan...@gmail.com> Date: 2014-11-09T05:02:17Z use merge join as inner join operator in JoinSuite commit 5cb98c306f76183e4148d9b0a6b0a8ce4d58368e Author: Liquan Pei <liquan...@gmail.com> Date: 2014-11-09T05:30:52Z improve inline comments ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org