[jira] [Commented] (SPARK-10968) Incorrect Join behavior in filter conditions

RaviShankar KS (JIRA) Wed, 07 Oct 2015 01:25:26 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946493#comment-14946493
 ]


RaviShankar KS commented on SPARK-10968:
----------------------------------------

not actually incorrect.
DataFrame d5.value has fields col1 and col2, in that order
DataFrame d5_opp.value has fields col2 and col1, in that order

If I add a condition "d5.value === d5_opp.value" to the join, then it should 
replicate the exact case as in :
d5.value.col1 === d5_opp.value.col1 && d5.value.col2 === d5_opp.value.col2

The leaf fields have the same data types in both DataFrames.
I believe order should not matter here.


> Incorrect Join behavior in filter conditions
> --------------------------------------------
>
>                 Key: SPARK-10968
>                 URL: https://issues.apache.org/jira/browse/SPARK-10968
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.1, 1.5.1
>         Environment: RHEL, spark-shell
>            Reporter: RaviShankar KS
>              Labels: DataFramejoin, sql,
>         Attachments: CreateDF_sparkshell_jira.scala
>
>
> We notice that the join conditions are not working as expected in the case of 
> nested columns being compared.
> As long as leaf columns have the same name under a nested column, should 
> order matter ??
> Consider below example for two data frames d5 and d5_opp : 
> d5 and d5_opp have a nested field 'value', but their inner leaf columns do 
> not have the same ordering. 
> --       d5.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col1: string (nullable = true)
>  |    |    |-- col2: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col1: string (nullable = false)
>  |    |-- col2: string (nullable = false)
> --        d5_opp.printSchema
> root
>  |-- key: integer (nullable = false)
>  |-- value: array (nullable = true)
>  |    |-- element: struct (containsNull = true)
>  |    |    |-- col2: string (nullable = true)
>  |    |    |-- col1: string (nullable = true)
>  |-- value1: struct (nullable = false)
>  |    |-- col2: string (nullable = false)
>  |    |-- col1: string (nullable = false)
> The below join statement do not work in spark 1.5, and raises exception. In 
> spark 1.4, no exception is raised, but join result is incorrect :
> --    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value"  === 
> $"d5_opp.value",  "inner").show
> Exception raised is :  
> org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due 
> to data type mismatch: differing types in '(value = value)' 
> (array<struct<col1:string,col2:string>> and 
> array<struct<col2:string,col1:string>>).;
> --    d5.as("d5").join( d5_opp.as("d5_opp"),  $"d5.value1"  === 
> $"d5_opp.value1",  "inner").show
> Exception raised is :
> org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' 
> due to data type mismatch: differing types in '(value1 = value1)' 
> (struct<col1:string,col2:string> and struct<col2:string,col1:string>).;
> // Code to be used in spark shell to create the data frames is attached.
> -------------------------
> The only work-around is to explode the conditions for every leaf field. 
> In our case, we are generating the conditions and dataframes 
> programmatically, and exploding the conditions for every leaf field is 
> additional overhead, and may not be always possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10968) Incorrect Join behavior in filter conditions

Reply via email to