[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
KaiXinXIaoLei updated SPARK-23542: ---------------------------------- Description: The optimized logical plan of query 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)' is : {noformat} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} >== Optimized Logical Plan == >Join LeftSemi, (i#143 = i#145) >:- MetastoreRelation default, tt1 >+- MetastoreRelation default, tt2 But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is : {code:java} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23] {code} So i think the optimized logical plan of 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` should be further optimization. was: The optimized logical plan of query 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` is : >== Optimized Logical Plan == >Join LeftSemi, (i#143 = i#145) >:- MetastoreRelation default, tt1 >+- MetastoreRelation default, tt2 But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is : >== Optimized Logical Plan == Join LeftSemi, (i#152 = i#150) :- Filter isnotnull(i#150) : +- MetastoreRelation default, tt1 +- Project [i#152|#152] +- MetastoreRelation default, tt2 So i think the optimized logical plan of 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` should be further optimization. > The `where exists' action in optimized logical plan should be optimized > ------------------------------------------------------------------------ > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: KaiXinXIaoLei > Priority: Major > > The optimized logical plan of query 'select * from tt1 where exists (select * > from tt2 where tt1.i = tt2.i)' is : > > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > >== Optimized Logical Plan == > >Join LeftSemi, (i#143 = i#145) > >:- MetastoreRelation default, tt1 > >+- MetastoreRelation default, tt2 > But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23] > {code} > > So i think the optimized logical plan of 'select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i);` should be further optimization. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org