GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/20276

    [SPARK-14948][SQL] disambiguate attributes in join condition

    ## What changes were proposed in this pull request?
    
    `Dataset.col/apply` returns a column reference, which is pretty useful to 
deal with duplicated names in join. e.g.
    ```
    val df1 = ... // [a: int, b: int]
    val df2 = ...// [b: int, c: int]
    
    df1.join(df2, df1("b") === df2("b"))
    df1.join(df2).drop(df2("b"))
    ...
    ```
    
    However, this is problematic for self-join, or joining DataFrames derived 
from the same DataFrame. The reason is that, the column reference returned by 
`Dataset.col` is actually `AttributeReference`, which means different 
DataFrames may return same column reference. After join, the right side would 
be de-duplicated if it has conflicting attributes with the left side, and the 
column reference returned by right side would be missing after join, or be 
wrong and refers to columns from the left side.
    
    To fix this issue entirely, we may need to define a real column reference 
that is globally uique, and design a dataframe lineage mechanism so that we can 
use column reference from another dataframe in a dataframe operation, e.g.
    ```
    val df3 = df1.join(df2)
    df3.drop(df2("b"))
    ```
    
    This is a lot of work and is too late for 2.3, here I propose a simple and 
safe solution to disambiguate attributes in join condition only, which is the 
most common problematic case.
    
    The idea is simple, we assign a globally unique id to each dataframe, via 
`AnalysisBarrier`. `Dataset.col` returns a special attribute that carries the 
id of dataframe it comes from. This special attribute is mostly a no-op and 
will be removed during resolution. It's only used when we are de-duplicating 
the join right side plan, these special attributes inside join condition would 
be replaced by the new attributes generated by the right side plan.
    
    ## How was this patch tested?
    
    new regression test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark join-bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20276
    
----
commit dd36ffb520b79c54e3efad9e79a88b3baf4fc985
Author: Wenchen Fan <wenchen@...>
Date:   2018-01-16T12:26:41Z

    disambiguate attributes in join condition

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to