GitHub user gczsjdy opened a pull request:

    https://github.com/apache/spark/pull/19862

    Make SortMergeJoin read less data when wholeStageCodegen is off

    ## What changes were proposed in this pull request?
    
    In SortMergeJoin(with wholeStageCodegen), an optimization already exists: 
if the left table of a partition is empty then there is no need to read the 
right table of this corresponding partition. This benefits the case in which 
many partitions of left table is empty and the right table is big.
    
    While in the code path without wholeStageCodegen, this optimization doesn't 
happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin 
& the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when 
wholeStageCodegen is off:
    <img width="908" alt="off_wholestage_before" 
src="https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG";>
    
    When the switch is on: 
    
![image](https://user-images.githubusercontent.com/7685352/33493675-c821b81a-d6fb-11e7-8cf8-2e5baff75be3.png)
    
    This PR lazy evaluates sort, and only trigger the right table read after 
reading the left table and find it's not empty.
    
    After this PR, with wholeStageCodegen off:
    
![image](https://user-images.githubusercontent.com/7685352/33493784-2e1ee89a-d6fc-11e7-8201-71273de0b857.png)
    
    ## How was this patch tested?
    
    Existed test suites.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gczsjdy/spark smj_less_read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19862
    
----
commit a8511e8b7e0240c471f88e703545301425960901
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-28T09:01:34Z

    Init solution

commit a5c14b473636e571c6d4f17798f220be080a27a1
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-28T09:31:07Z

    Style

commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-11-28T09:37:54Z

    Comments

commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6
Author: GuoChenzhao <chenzhao....@intel.com>
Date:   2017-12-01T11:32:06Z

    lazy evaluation in SortExec

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to