[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zzzzming95 updated SPARK-42503: ------------------------------- Description: In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. Suppose you have two tables, test1 and test2, which have the same table schema: {code:java} ``` CREATE TABLE `default`.`test1` ( `id` INT, `name` STRING, `age` INT, `dt` STRING) USING parquet PARTITIONED BY (dt) ```{code} The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. {code:java} ``` select * from (select * from test1 where dt="20230215" and age=1 ) t1 left join (select * from test1 where dt=="20230215" and age=2) t2 on t1.name=t2.name left join (select * from test2 where dt="20230215") t3 on t1.name=t2.name; ```{code} So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`. was: In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. Suppose you have two tables, test1 and test2, which have the same table schema: ``` CREATE TABLE `default`.`test1` ( `id` INT, `name` STRING, `age` INT, `dt` STRING) USING parquet PARTITIONED BY (dt) ``` The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. ``` select * from (select * from test1 where dt="20230215" and age=1 ) t1 left join (select * from test1 where dt=="20230215" and age=2) t2 on t1.name=t2.name left join (select * from test2 where dt="20230215") t3 on t1.name=t2.name; ``` So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`. > Spark SQL should do further validation on join condition fields > --------------------------------------------------------------- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.3.2 > Reporter: zzzzming95 > Priority: Major > Fix For: 3.4.0 > > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org