GitHub user KaiXinXiaoLei opened a pull request: https://github.com/apache/spark/pull/20865
[SPARK-23542] The exists action shoule be further optimized in logical plan ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) The optimized logical plan of query `select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)` is > == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17] The `exists` action will be rewritten as semi jion. But i the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i`, the optimized logical plan is : > == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- `Filter isnotnull`(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23] So i think the optimized logical plan of 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` should be further optimization. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) With this patch, the optimized logical plan of 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` is: > == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- Filter isnotnull(i#14) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] :- Filter isnotnull(i#16) +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17] Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/KaiXinXiaoLei/spark SPARK-23542 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20865.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20865 ---- commit 3bf987828acea096811ba8dd1d42de8221cac62d Author: KaiXinXiaoLei <584620569@...> Date: 2018-03-02T03:33:26Z message ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org