[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17154:
------------------------------------

    Assignee:     (was: Apache Spark)

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-17154
>                 URL: https://issues.apache.org/jira/browse/SPARK-17154
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.2, 2.0.0
>            Reporter: Kousuke Saruta
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to