[jira] [Commented] (SPARK-13801) DataFrame.col should return unresolved attribute

Denton Cockburn (JIRA) Wed, 30 Mar 2016 17:40:44 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-13801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15219137#comment-15219137
 ]


Denton Cockburn commented on SPARK-13801:
-----------------------------------------

I'm unsure if this is the same issue, but I hit upon this problem.

{code}
import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = Seq((1, 1,1,1), (2,2,2,2)).toDF("a", "b", "c", "d")
val f = df.where($"a" === 1).alias("a")
val s = df.where($"a" === 2).alias("b")

f.join(s, f("b") === s("b") and f("c") === s("c"), 
"outer").select(coalesce(f("b"), s("b")), coalesce(f("c"), s("c")), 
coalesce(f("d"), s("d"))).show
{code}

The output is:
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-------------+-------------+-------------+
|            1|            1|            1|
|         null|         null|         null|
+-------------+-------------+-------------+
{code}

Instead of:
{code}
|coalesce(b,b)|coalesce(c,c)|coalesce(d,d)|
+-------------+-------------+-------------+
|            1|            1|            1|
|            2|            2|            2|
+-------------+-------------+-------------+
{code}

> DataFrame.col should return unresolved attribute
> ------------------------------------------------
>
>                 Key: SPARK-13801
>                 URL: https://issues.apache.org/jira/browse/SPARK-13801
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Wenchen Fan
>
> Recently I saw some JIRAs complain about wrong result when using DataFrame 
> API. After checking their queries, I found it was caused by un-direct 
> self-join and they build wrong join conditions. For example:
> {code}
> val df = ...
> val df2 = df.filter(...)
> df.join(df2, (df("key") + 1) === df2("key"))
> {code}
> In this case, the confusing part is: df("key") and df2("key2") reference to 
> the same column, while df and df2 are different DataFrames.
> I think the biggest problem is, we give users the resolved attribute. 
> However, resolved attribute is not real column, as logical plan's output may 
> change. For example, we will generate new output for the right child in 
> self-join.
> My proposal is: `DataFrame.col` should always return unresolved attribute. We 
> can still do the resolution to make sure the given column name is resolvable, 
> but don't return the resolved one, just get the name out and wrap it with 
> UnresolvedAttribute.
> Now if users run the example query I gave at the beginning, they will get 
> analysis exception, and they will understand they need to alias df and df2 
> before join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13801) DataFrame.col should return unresolved attribute

Reply via email to