[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15186660#comment-15186660
 ] 

Adrian Wang edited comment on SPARK-13393 at 3/9/16 7:31 AM:
-------------------------------------------------------------

How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.
The current design would not throw any exception because we assume same cols in 
condition are from different sides, as I have declared. I don't think that's a 
decent way.


was (Author: adrian-wang):
How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.

> Column mismatch issue in left_outer join using Spark DataFrame
> --------------------------------------------------------------
>
>                 Key: SPARK-13393
>                 URL: https://issues.apache.org/jira/browse/SPARK-13393
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1|         1|
> |  2|         2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|      null|
> |  2|         2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to