[ https://issues.apache.org/jira/browse/SPARK-23542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
KaiXinXIaoLei updated SPARK-23542: ---------------------------------- Description: The optimized logical plan of query '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*' is : {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is : {noformat} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} So i think the optimized logical plan of '*select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} was: The optimized logical plan of query 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i)' is : {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} But the query of `select * from tt1 left semi join tt2 on tt2.i = tt1.i` is : {noformat} == Optimized Logical Plan == Join LeftSemi, (i#22 = i#20) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] +- Project [i#22] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} So i think the optimized logical plan of 'select * from tt1 where exists (select * from tt2 where tt1.i = tt2.i);` should be further optimization. {code:java} == Optimized Logical Plan == Join LeftSemi, (i#14 = i#16) :- Filter isnotnull(i#20) : +- HiveTableRelation `default`.`tt1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] +- Project [i#16] +- HiveTableRelation `default`.`tt2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > The `where exists' action in optimized logical plan should be optimized > ------------------------------------------------------------------------ > > Key: SPARK-23542 > URL: https://issues.apache.org/jira/browse/SPARK-23542 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: KaiXinXIaoLei > Priority: Major > > The optimized logical plan of query '*select * from tt1 where exists (select > * from tt2 where tt1.i = tt2.i)*' is : > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > > But the query of `*select * from tt1 left semi join tt2 on tt2.i = tt1.i*` is > : > {noformat} > == Optimized Logical Plan == > Join LeftSemi, (i#22 = i#20) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#20, s#21] > +- Project [i#22] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#22, s#23]{noformat} > > So i think the optimized logical plan of '*select * from tt1 where exists > (select * from tt2 where tt1.i = tt2.i)*;` should be further optimization. > {code:java} > == Optimized Logical Plan == > Join LeftSemi, (i#14 = i#16) > :- Filter isnotnull(i#20) > : +- HiveTableRelation `default`.`tt1`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#14, s#15] > +- Project [i#16] > +- HiveTableRelation `default`.`tt2`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#16, s#17]{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org