GitHub user windpiger opened a pull request: https://github.com/apache/spark/pull/16255
[SPARK-18609][SQL][WIP]Fix when CTE with Join between two table with same column ## What changes were proposed in this pull request? After optimize the SQL with CTE and Join between two tables with the same column name, there will throw a exception when the Physical Plan executed. SQL: ``` >spark.sql("CREATE TABLE p1 (col STRING)" >spark.sql("CREATE TABLE p2 (col STRING) " >spark.sql("set spark.sql.crossJoin.enabled = true") >spark.sql("WITH CTE AS (SELECT s2.col as col FROM p1 CROSS JOIN (SELECT e.col as col FROM p2 E) s2) SELECT T1.col as c1, T2.col as c2 FROM CTE T1 CROSS JOIN CTE T2") ``` before fixed the SQL explained: ``` == Physical Plan == *Project [col#123 AS c1#104, col#126 AS c2#105] +- CartesianProduct :- CartesianProduct : :- HiveTableScan MetastoreRelation default, p1 : +- HiveTableScan [col#123], MetastoreRelation default, p2, E +- *!Project [col#123 AS col#126] +- CartesianProduct :- HiveTableScan MetastoreRelation default, p1 +- HiveTableScan MetastoreRelation default, p2, E ``` when execute the Physical Plan above, it will throw exception because of there is no attribute for col#123 to bind( the rules ColumnPruning/RemoveAliasOnlyProject are the root cause): ` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: col#123 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:275) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:61) at org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:60) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) ` after fixed the SQL explained: ``` == Physical Plan == *Project [col#12 AS c1#0, col#12 AS c2#1] +- CartesianProduct :- CartesianProduct : :- HiveTableScan MetastoreRelation default, p1 : +- HiveTableScan [col#12], MetastoreRelation default, p2 +- CartesianProduct :- HiveTableScan MetastoreRelation default, p1 +- HiveTableScan [col#12], MetastoreRelation default, p2 ``` and the result is ``` +---+---+ | c1| c2| +---+---+ +---+---+ ``` ## How was this patch tested? unit test added You can merge this pull request into a Git repository by running: $ git pull https://github.com/windpiger/spark CETwithJoinBetweenSameColumnTableExcep Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16255.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16255 ---- ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org