[ https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780 ]
Herman van Hovell commented on SPARK-17120: ------------------------------------------- TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, into the left hand side of the join (which should have been the right hand side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an inner join. The optimized plan before disabling the {{PushDownPredicate}} rule (I had to disable the {{PruneFilters}} rule to prevent the plan from being erased): {noformat} Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join Inner :- Project [value#2 AS int_col_6#4] : +- Filter false : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} The optimized plan after disabling the {{PushDownPredicate}} rule: {noformat} == Optimized Logical Plan == Filter isnotnull(int_col#16) +- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] +- Join LeftOuter, false :- Project [value#2 AS int_col_6#4] : +- SerializeFromObject [input[0, int, true] AS value#2] : +- ExternalRDD [obj#1] +- Project [value#10 AS int_col_1#12] +- SerializeFromObject [input[0, int, true] AS value#10] +- ExternalRDD [obj#9] {noformat} Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. > Analyzer incorrectly optimizes plan to empty LocalRelation > ---------------------------------------------------------- > > Key: SPARK-17120 > URL: https://issues.apache.org/jira/browse/SPARK-17120 > Project: Spark > Issue Type: Bug > Affects Versions: 2.1.0 > Reporter: Josh Rosen > Priority: Blocker > > Consider the following query: > {code} > sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > println(sql(""" > SELECT > * > FROM ( > SELECT > COALESCE(t2.int_col_1, t1.int_col_6) AS int_col > FROM table_3 t1 > LEFT JOIN table_4 t2 ON false > ) t where (t.int_col) is not null > """).collect().toSeq) > {code} > In the innermost query, the LEFT JOIN's condition is {{false}} but > nevertheless the number of rows produced should equal the number of rows in > {{table_3}} (which is non-empty). Since no values are {{null}}, the outer > {{where}} should retain all rows, so the overall result of this query should > contain a single row with the value '97'. > Instead, the current Spark master (as of > 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking > at {{explain}}, it appears that the logical plan is optimizing to > {{LocalRelation <empty>}}, so Spark doesn't even run the query. My suspicion > is that there's a bug in constraint propagation or filter pushdown. > This issue doesn't seem to affect Spark 2.0, so I think it's a regression in > master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org