[ https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219137#comment-15219137 ]
Denton Cockburn commented on SPARK-13801: ----------------------------------------- I'm unsure if this is the same issue, but I hit upon this problem. {code} import sqlContext.implicits._ import org.apache.spark.sql.functions._ val df = Seq((1, 1,1,1), (2,2,2,2)).toDF("a", "b", "c", "d") val f = df.where($"a" === 1).alias("a") val s = df.where($"a" === 2).alias("b") f.join(s, f("b") === s("b") and f("c") === s("c"), "outer").select(coalesce(f("b"), s("b")), coalesce(f("c"), s("c")), coalesce(f("d"), s("d"))).show {code} The output is: {code} |coalesce(b,b)|coalesce(c,c)|coalesce(d,d)| +-------------+-------------+-------------+ | 1| 1| 1| | null| null| null| +-------------+-------------+-------------+ {code} Instead of: {code} |coalesce(b,b)|coalesce(c,c)|coalesce(d,d)| +-------------+-------------+-------------+ | 1| 1| 1| | 2| 2| 2| +-------------+-------------+-------------+ {code} > DataFrame.col should return unresolved attribute > ------------------------------------------------ > > Key: SPARK-13801 > URL: https://issues.apache.org/jira/browse/SPARK-13801 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Wenchen Fan > > Recently I saw some JIRAs complain about wrong result when using DataFrame > API. After checking their queries, I found it was caused by un-direct > self-join and they build wrong join conditions. For example: > {code} > val df = ... > val df2 = df.filter(...) > df.join(df2, (df("key") + 1) === df2("key")) > {code} > In this case, the confusing part is: df("key") and df2("key2") reference to > the same column, while df and df2 are different DataFrames. > I think the biggest problem is, we give users the resolved attribute. > However, resolved attribute is not real column, as logical plan's output may > change. For example, we will generate new output for the right child in > self-join. > My proposal is: `DataFrame.col` should always return unresolved attribute. We > can still do the resolution to make sure the given column name is resolvable, > but don't return the resolved one, just get the name out and wrap it with > UnresolvedAttribute. > Now if users run the example query I gave at the beginning, they will get > analysis exception, and they will understand they need to alias df and df2 > before join. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org