GitHub user KaiXinXiaoLei opened a pull request:

    https://github.com/apache/spark/pull/20865

    [SPARK-23542] The exists action shoule be further optimized in logical plan

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    The optimized logical plan of query `select * from tt1 where exists (select 
*  from tt2  where tt1.i = tt2.i)` is
    
    > == Optimized Logical Plan ==
    Join LeftSemi, (i#14 = i#16)
      :- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
     +- Project [i#16]
      +- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]
    
     The `exists` action will be rewritten as semi jion. But i the query of 
`select * from tt1 left semi join tt2 on tt2.i = tt1.i`, the optimized logical 
plan is :
    
    > == Optimized Logical Plan ==
    Join LeftSemi, (i#22 = i#20)
    :- `Filter isnotnull`(i#20)
    : +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21]
    +- Project [i#22]
    +- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]
    
     So i think the  optimized logical plan of 'select * from tt1 where exists 
(select *  from tt2  where tt1.i = tt2.i);` should be further optimization.
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    With this patch, the  optimized logical plan of 'select * from tt1 where 
exists (select *  from tt2  where tt1.i = tt2.i);`  is:
    
    > == Optimized Logical Plan ==
    Join LeftSemi, (i#14 = i#16)
    :- Filter isnotnull(i#14)
      : +- HiveTableRelation `default`.`tt1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15]
    +- Project [i#16]
     :- Filter isnotnull(i#16)
      +- HiveTableRelation `default`.`tt2`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/KaiXinXiaoLei/spark SPARK-23542

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20865.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20865
    
----
commit 3bf987828acea096811ba8dd1d42de8221cac62d
Author: KaiXinXiaoLei <584620569@...>
Date:   2018-03-02T03:33:26Z

    message

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to