[ https://issues.apache.org/jira/browse/SPARK-10968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14946493#comment-14946493 ]
RaviShankar KS commented on SPARK-10968: ---------------------------------------- not actually incorrect. DataFrame d5.value has fields col1 and col2, in that order DataFrame d5_opp.value has fields col2 and col1, in that order If I add a condition "d5.value === d5_opp.value" to the join, then it should replicate the exact case as in : d5.value.col1 === d5_opp.value.col1 && d5.value.col2 === d5_opp.value.col2 The leaf fields have the same data types in both DataFrames. I believe order should not matter here. > Incorrect Join behavior in filter conditions > -------------------------------------------- > > Key: SPARK-10968 > URL: https://issues.apache.org/jira/browse/SPARK-10968 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.4.1, 1.5.1 > Environment: RHEL, spark-shell > Reporter: RaviShankar KS > Labels: DataFramejoin, sql, > Attachments: CreateDF_sparkshell_jira.scala > > > We notice that the join conditions are not working as expected in the case of > nested columns being compared. > As long as leaf columns have the same name under a nested column, should > order matter ?? > Consider below example for two data frames d5 and d5_opp : > d5 and d5_opp have a nested field 'value', but their inner leaf columns do > not have the same ordering. > -- d5.printSchema > root > |-- key: integer (nullable = false) > |-- value: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- col1: string (nullable = true) > | | |-- col2: string (nullable = true) > |-- value1: struct (nullable = false) > | |-- col1: string (nullable = false) > | |-- col2: string (nullable = false) > -- d5_opp.printSchema > root > |-- key: integer (nullable = false) > |-- value: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- col2: string (nullable = true) > | | |-- col1: string (nullable = true) > |-- value1: struct (nullable = false) > | |-- col2: string (nullable = false) > | |-- col1: string (nullable = false) > The below join statement do not work in spark 1.5, and raises exception. In > spark 1.4, no exception is raised, but join result is incorrect : > -- d5.as("d5").join( d5_opp.as("d5_opp"), $"d5.value" === > $"d5_opp.value", "inner").show > Exception raised is : > org.apache.spark.sql.AnalysisException: cannot resolve '(value = value)' due > to data type mismatch: differing types in '(value = value)' > (array<struct<col1:string,col2:string>> and > array<struct<col2:string,col1:string>>).; > -- d5.as("d5").join( d5_opp.as("d5_opp"), $"d5.value1" === > $"d5_opp.value1", "inner").show > Exception raised is : > org.apache.spark.sql.AnalysisException: cannot resolve '(value1 = value1)' > due to data type mismatch: differing types in '(value1 = value1)' > (struct<col1:string,col2:string> and struct<col2:string,col1:string>).; > // Code to be used in spark shell to create the data frames is attached. > ------------------------- > The only work-around is to explode the conditions for every leaf field. > In our case, we are generating the conditions and dataframes > programmatically, and exploding the conditions for every leaf field is > additional overhead, and may not be always possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org