Wenchen Fan created SPARK-13801:
-----------------------------------

             Summary: DataFrame.col should return unresolved attribute
                 Key: SPARK-13801
                 URL: https://issues.apache.org/jira/browse/SPARK-13801
             Project: Spark
          Issue Type: Improvement
            Reporter: Wenchen Fan


Recently I saw some JIRAs complain about wrong result when using DataFrame API. 
After checking their queries, I found it was caused by un-direct self-join and 
they build wrong join conditions. For example:

{code}
val df = ...
val df2 = df.filter(...)
df.join(df2, df("key") === df2("key"))
{code}

In this case, the confusing part is: df("key") and df2("key2") reference to the 
same column, while df and df2 are different DataFrames.

I think the biggest problem is, we give users the resolved attribute. However, 
resolved attribute is not real column, as logical plan's output may change. For 
example, we will generate new output for the right child in self-join.

My proposal is: `DataFrame.apply` should always return unresolved attribute. We 
can still do the resolution to make sure the given column name is resolvable, 
but don't return the resolved one, just get the name out and wrap it with 
UnresolvedAttribute.

Now if users run the example query I gave at the beginning, they will get 
analysis exception, and they will understand they need to alias df and df2 
before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to