[ https://issues.apache.org/jira/browse/SPARK-31148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tanapol Nearunchorn updated SPARK-31148: ---------------------------------------- Summary: Switching inner join statement yield different result (was: Physical Plan and Optimized Logical Plan are mismatch) > Switching inner join statement yield different result > ----------------------------------------------------- > > Key: SPARK-31148 > URL: https://issues.apache.org/jira/browse/SPARK-31148 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4 > Reporter: Tanapol Nearunchorn > Priority: Major > > Given this query > > {code:java} > select r.id > from sqoop_wongnai.wongnai__w_ref r > inner join sqoop_wongnai.wongnai__w_photo p on p.root_referrer_id = r.id > inner join tests.restaurants_id_export_photos i on i.restaurant_id = > r.restaurant_id > where r.restaurant_id = 360 > {code} > Spark return nothing except when I switch inner join statement between line 3 > and 4 there are results returned. > I guarantee that we have data that match the join condition above. > Here's result from explain extend of above query: > > {code:java} > == Parsed Logical Plan == > 'Project ['r.id] > +- 'Filter ('r.restaurant_id = 360) > +- 'Join Inner, ('i.restaurant_id = 'r.restaurant_id) > :- 'Join Inner, ('p.root_referrer_id = 'r.id) > : :- 'SubqueryAlias `r` > : : +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_ref` > : +- 'SubqueryAlias `p` > : +- 'UnresolvedRelation `sqoop_wongnai`.`wongnai__w_photo` > +- 'SubqueryAlias `i` > +- 'UnresolvedRelation `tests`.`restaurants_id_export_photos` > == Analyzed Logical Plan == > id: bigint > Project [id#1834834L] > +- Filter (restaurant_id#1834836L = cast(360 as bigint)) > +- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L) > :- Join Inner, (root_referrer_id#1834858L = id#1834834L) > : :- SubqueryAlias `r` > : : +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_ref` > : : +- HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#1834834L, > restaurant_id#1834836L,] > : +- SubqueryAlias `p` > : +- SubqueryAlias `sqoop_wongnai`.`wongnai__w_photo` > : +- HiveTableRelation `sqoop_wongnai`.`wongnai__w_photo`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#1834858L, ... > 14 more fields] > +- SubqueryAlias `i` > +- SubqueryAlias `tests`.`restaurants_id_export_photos` > +- HiveTableRelation `tests`.`restaurants_id_export_photos`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [restaurant_id#1834876L]== Optimized Logical Plan == > Project [id#1834834L] > +- Join Inner, (restaurant_id#1834876L = restaurant_id#1834836L) > :- Project [id#1834834L, restaurant_id#1834836L] > : +- Join Inner, (root_referrer_id#1834858L = id#1834834L) > : :- Project [id#1834834L, restaurant_id#1834836L] > : : +- Filter ((isnotnull(restaurant_id#1834836L) && > (restaurant_id#1834836L = 360)) && isnotnull(id#1834834L)) > : : +- InMemoryRelation [id#1834834L, restaurant_id#1834836L], > StorageLevel(disk, memory, deserialized, 1 replicas) > : : +- Scan hive sqoop_wongnai.wongnai__w_ref [id#87357L, > restaurant_id#87359L], HiveTableRelation `sqoop_wongnai`.`wongnai__w_ref`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#87357L, > restaurant_id#87359L] > : +- Project [root_referrer_id#1834858L] > : +- Filter (isnotnull(root_referrer_id#1834858L) && > bloomfilter#1835088 of [id#1834834L] filtering [root_referrer_id#1834858L]) > : +- InMemoryRelation [root_referrer_id#1834858L], > StorageLevel(disk, memory, deserialized, 1 replicas) > : +- Scan hive sqoop_wongnai.wongnai__w_photo > [root_referrer_id#82906L], HiveTableRelation > `sqoop_wongnai`.`wongnai__w_photo`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#82906L] > +- Filter ((isnotnull(restaurant_id#1834876L) && (restaurant_id#1834876L = > 360)) && bloomfilter#1835087 of [restaurant_id#1834836L] filtering > [restaurant_id#1834876L]) > +- HiveTableRelation `tests`.`restaurants_id_export_photos`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [restaurant_id#1834876L] > == Physical Plan == > *(7) Project [id#1834834L] > +- *(7) SortMergeJoin [restaurant_id#1834836L], [restaurant_id#1834876L], > Inner > :- *(5) Sort [restaurant_id#1834836L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(restaurant_id#1834836L, 200) > : +- *(4) Project [id#1834834L, restaurant_id#1834836L] > : +- *(4) SortMergeJoin [id#1834834L], [root_referrer_id#1834858L], > Inner > : :- *(2) Sort [id#1834834L ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(id#1834834L, 200) > : : +- *(1) Filter ((isnotnull(restaurant_id#1834836L) && > (restaurant_id#1834836L = 360)) && isnotnull(id#1834834L)) > : : +- *(1) InMemoryTableScan [id#1834834L, > restaurant_id#1834836L], [isnotnull(restaurant_id#1834836L), > (restaurant_id#1834836L = 360), isnotnull(id#1834834L)] > : : +- InMemoryRelation [id#1834834L, > restaurant_id#1834836L], StorageLevel(disk, memory, deserialized, 1 replicas) > : : +- Scan hive sqoop_wongnai.wongnai__w_ref > [id#87357L, restaurant_id#87359L], HiveTableRelation > `sqoop_wongnai`.`wongnai__w_ref`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [id#87357L, > restaurant_id#87359L] > : +- *(3) Sort [root_referrer_id#1834858L ASC NULLS FIRST], > false, 0 > : +- Exchange hashpartitioning(root_referrer_id#1834858L, 200) > : +- Filter (isnotnull(root_referrer_id#1834858L) && > bloomfilter#1835088 of [bf1835088 id#1834834L estimatedNumRows=1457534] > filtering [root_referrer_id#1834858L]) > : : +- GenerateBloomFilter > : : +- ReusedExchange [id#1834834L, > restaurant_id#1834836L], Exchange hashpartitioning(id#1834834L, 200) > : +- InMemoryTableScan [root_referrer_id#1834858L], > [isnotnull(root_referrer_id#1834858L), bloomfilter#1835088 of [bf1835088 > id#1834834L estimatedNumRows=1457534] filtering [root_referrer_id#1834858L]] > : :- InMemoryRelation > [root_referrer_id#1834858L], StorageLevel(disk, memory, deserialized, 1 > replicas) > : : +- Scan hive > sqoop_wongnai.wongnai__w_photo [root_referrer_id#82906L], HiveTableRelation > `sqoop_wongnai`.`wongnai__w_photo`, > org.apache.hadoop.hive.serde2.avro.AvroSerDe, [root_referrer_id#82906L] > : +- GenerateBloomFilter > : +- ReusedExchange [id#1834834L, > restaurant_id#1834836L], Exchange hashpartitioning(id#1834834L, 200) > +- *(6) Sort [restaurant_id#1834876L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(restaurant_id#1834876L, 200) > +- Filter ((isnotnull(restaurant_id#1834876L) && > (restaurant_id#1834876L = 360)) && bloomfilter#1835087 of [bf1835088 > id#1834834L estimatedNumRows=1457534] filtering [restaurant_id#1834876L]) > : +- GenerateBloomFilter > : +- ReusedExchange [id#1834834L, restaurant_id#1834836L], > Exchange hashpartitioning(id#1834834L, 200) > +- Scan hive tests.restaurants_id_export_photos > [restaurant_id#1834876L], HiveTableRelation > `tests`.`restaurants_id_export_photos`, > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [restaurant_id#1834876L] > {code} > > > > I noticed that in physical plan > > {code:java} > bloomfilter#1835087 of [bf1835088 id#1834834L estimatedNumRows=1457534] > filtering [restaurant_id#1834876L] > {code} > it use bf1835088 id#1834834L for filtering restaurant_id#1834876L and which > is different from join condition. > I think that the correct Physical query plan should use bf1835087 > restaurant_id#1834836L for filtering restaurant_id#1834876L. > > You can see full of query plan of both incorrect and correct in this link: > [https://gist.github.com/tanapoln/67a335858dede5eddd9d80659de60d72] > > I don't know why switching inner join statement yield the different result. > I don't know it is a bug or I misunderstand join in Spark. > Please help. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org