GitHub user gczsjdy opened a pull request: https://github.com/apache/spark/pull/19862
Make SortMergeJoin read less data when wholeStageCodegen is off ## What changes were proposed in this pull request? In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if the left table of a partition is empty then there is no need to read the right table of this corresponding partition. This benefits the case in which many partitions of left table is empty and the right table is big. While in the code path without wholeStageCodegen, this optimization doesn't happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin & the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when wholeStageCodegen is off: <img width="908" alt="off_wholestage_before" src="https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG"> When the switch is on: ![image](https://user-images.githubusercontent.com/7685352/33493675-c821b81a-d6fb-11e7-8cf8-2e5baff75be3.png) This PR lazy evaluates sort, and only trigger the right table read after reading the left table and find it's not empty. After this PR, with wholeStageCodegen off: ![image](https://user-images.githubusercontent.com/7685352/33493784-2e1ee89a-d6fc-11e7-8201-71273de0b857.png) ## How was this patch tested? Existed test suites. You can merge this pull request into a Git repository by running: $ git pull https://github.com/gczsjdy/spark smj_less_read Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19862.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19862 ---- commit a8511e8b7e0240c471f88e703545301425960901 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-28T09:01:34Z Init solution commit a5c14b473636e571c6d4f17798f220be080a27a1 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-28T09:31:07Z Style commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-11-28T09:37:54Z Comments commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6 Author: GuoChenzhao <chenzhao....@intel.com> Date: 2017-12-01T11:32:06Z lazy evaluation in SortExec ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org