GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20276
[SPARK-14948][SQL] disambiguate attributes in join condition ## What changes were proposed in this pull request? `Dataset.col/apply` returns a column reference, which is pretty useful to deal with duplicated names in join. e.g. ``` val df1 = ... // [a: int, b: int] val df2 = ...// [b: int, c: int] df1.join(df2, df1("b") === df2("b")) df1.join(df2).drop(df2("b")) ... ``` However, this is problematic for self-join, or joining DataFrames derived from the same DataFrame. The reason is that, the column reference returned by `Dataset.col` is actually `AttributeReference`, which means different DataFrames may return same column reference. After join, the right side would be de-duplicated if it has conflicting attributes with the left side, and the column reference returned by right side would be missing after join, or be wrong and refers to columns from the left side. To fix this issue entirely, we may need to define a real column reference that is globally uique, and design a dataframe lineage mechanism so that we can use column reference from another dataframe in a dataframe operation, e.g. ``` val df3 = df1.join(df2) df3.drop(df2("b")) ``` This is a lot of work and is too late for 2.3, here I propose a simple and safe solution to disambiguate attributes in join condition only, which is the most common problematic case. The idea is simple, we assign a globally unique id to each dataframe, via `AnalysisBarrier`. `Dataset.col` returns a special attribute that carries the id of dataframe it comes from. This special attribute is mostly a no-op and will be removed during resolution. It's only used when we are de-duplicating the join right side plan, these special attributes inside join condition would be replaced by the new attributes generated by the right side plan. ## How was this patch tested? new regression test You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark join-bug Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20276.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20276 ---- commit dd36ffb520b79c54e3efad9e79a88b3baf4fc985 Author: Wenchen Fan <wenchen@...> Date: 2018-01-16T12:26:41Z disambiguate attributes in join condition ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org