GitHub user windpiger opened a pull request:

    https://github.com/apache/spark/pull/16255

    [SPARK-18609][SQL][WIP]Fix when CTE with Join between two table with same 
column

    ## What changes were proposed in this pull request?
       After optimize the SQL with CTE and Join between two tables with the 
same column name,  there will throw a exception when the Physical Plan executed.
    SQL:
    ```
    >spark.sql("CREATE TABLE p1 (col STRING)"
    >spark.sql("CREATE TABLE p2 (col STRING) "
    >spark.sql("set spark.sql.crossJoin.enabled = true")
    >spark.sql("WITH CTE AS (SELECT s2.col as col FROM p1 CROSS JOIN (SELECT 
e.col as col FROM p2 E) s2) SELECT T1.col as c1, T2.col as c2 FROM CTE T1 CROSS 
JOIN CTE T2")
    ```
    
    before fixed the SQL explained:
    ```
    == Physical Plan ==
    *Project [col#123 AS c1#104, col#126 AS c2#105]
    +- CartesianProduct
       :- CartesianProduct
       :  :- HiveTableScan MetastoreRelation default, p1
       :  +- HiveTableScan [col#123], MetastoreRelation default, p2, E
       +- *!Project [col#123 AS col#126]
          +- CartesianProduct
             :- HiveTableScan MetastoreRelation default, p1
             +- HiveTableScan MetastoreRelation default, p2, E
    ```
    when execute the Physical Plan above, it will throw exception because of 
there is no attribute for col#123 to bind( the rules 
ColumnPruning/RemoveAliasOnlyProject are the root cause):
    `
    org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: col#123
        at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
        at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
        at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:275)
        at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
        at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:61)
        at 
org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:60)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    ` 
    
    after fixed the SQL explained:
    ```
    == Physical Plan ==
    *Project [col#12 AS c1#0, col#12 AS c2#1]
    +- CartesianProduct
       :- CartesianProduct
       :  :- HiveTableScan MetastoreRelation default, p1
       :  +- HiveTableScan [col#12], MetastoreRelation default, p2
       +- CartesianProduct
          :- HiveTableScan MetastoreRelation default, p1
          +- HiveTableScan [col#12], MetastoreRelation default, p2
    ```
    and the result is 
    ```
    +---+---+
    | c1| c2|
    +---+---+
    +---+---+
    ```
    
    
    
    ## How was this patch tested?
    unit test added

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/windpiger/spark 
CETwithJoinBetweenSameColumnTableExcep

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16255.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16255
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to