[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780
 ] 

Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:01 PM:
-

TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.


was (Author: hvanhovell):
TL;DR the {{PushDownPredicate}} rule pushed the {{false}} join predicate down, 
into the left hand side of the join (which should have been the right hand 
side). This caused the {{EliminateOuterJoin}} rule to rewrite this into an 
inner join.

The optimized plan before disabling the {{PushDownPredicate}} rule (I had to 
disable the {{PruneFilters}} rule to prevent the plan from being erased):
{noformat}
Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
+- Join Inner
   :- Project [value#2 AS int_col_6#4]
   :  +- Filter false
   : +- SerializeFromObject [input[0, int, true] AS value#2]
   :+- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9]
{noformat}

The optimized plan after disabling the {{PushDownPredicate}} rule:
{noformat}
== Optimized Logical Plan ==
Filter isnotnull(int_col#16)
+- Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
   +- Join LeftOuter, false
  :- Project [value#2 AS int_col_6#4]
  :  +- SerializeFromObject [input[0, int, true] AS value#2]
  : +- ExternalRDD [obj#1]
  +- Project [value#10 AS int_col_1#12]
 +- SerializeFromObject [input[0, int, true] AS value#10]
+- ExternalRDD [obj#9]
{noformat}

Btw set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this.

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
> This issue doesn't seem to affect Spark 2.0, so I think it's a regression in 
> master. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Comment Edited] (SPARK-17120) Analyzer incorrectly optimizes plan to empty LocalRelation

2016-08-23 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15433780#comment-15433780
 ] 

Herman van Hovell edited comment on SPARK-17120 at 8/23/16 11:02 PM:
-

TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.

(updated this: my first attempt at diagnoses was way off).


was (Author: hvanhovell):
TL;DR the {{EliminateOuterJoin}} rule converts the outer join into an Inner 
join:
{noformat}
16/08/24 00:55:46 TRACE SparkOptimizer: 
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin ===
 Project [coalesce(int_col_1#12, int_col_6#4) AS int_col#16] Project 
[coalesce(int_col_1#12, int_col_6#4) AS int_col#16]
 +- Filter isnotnull(coalesce(int_col_1#12, int_col_6#4))+- Filter 
isnotnull(coalesce(int_col_1#12, int_col_6#4))
!   +- Join LeftOuter, false+- Join 
Inner, false
   :- Project [value#2 AS int_col_6#4] :- 
Project [value#2 AS int_col_6#4]
   :  +- SerializeFromObject [input[0, int, true] AS value#2]  :  
+- SerializeFromObject [input[0, int, true] AS value#2]
   : +- ExternalRDD [obj#1]:
 +- ExternalRDD [obj#1]
   +- Project [value#10 AS int_col_1#12]   +- 
Project [value#10 AS int_col_1#12]
  +- SerializeFromObject [input[0, int, true] AS value#10]
+- SerializeFromObject [input[0, int, true] AS value#10]
 +- ExternalRDD [obj#9] 
 +- ExternalRDD [obj#9]
{noformat}
I correctly assumes that a non-null literal cannot be well... non-null, and 
then converts the join. 

BTW: set {{spark.sql.crossJoin.enabled}} to {{true}} if you want to run this. 
Also use {{sc.setLogLevel("TRACE")}} to see what the optimizer is doing.

> Analyzer incorrectly optimizes plan to empty LocalRelation
> --
>
> Key: SPARK-17120
> URL: https://issues.apache.org/jira/browse/SPARK-17120
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Consider the following query:
> {code}
> sc.parallelize(Seq(97)).toDF("int_col_6").createOrReplaceTempView("table_3")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> println(sql("""
>   SELECT
>   *
>   FROM (
>   SELECT
>   COALESCE(t2.int_col_1, t1.int_col_6) AS int_col
>   FROM table_3 t1
>   LEFT JOIN table_4 t2 ON false
>   ) t where (t.int_col) is not null
> """).collect().toSeq)
> {code}
> In the innermost query, the LEFT JOIN's condition is {{false}} but 
> nevertheless the number of rows produced should equal the number of rows in 
> {{table_3}} (which is non-empty). Since no values are {{null}}, the outer 
> {{where}} should retain all rows, so the overall result of this query should 
> contain a single row with the value '97'.
> Instead, the current Spark master (as of 
> 12a89e55cbd630fa2986da984e066cd07d3bf1f7 at least) returns no rows. Looking 
> at {{explain}}, it appears that the logical plan is optimizing to 
> {{LocalRelation }}, so Spark doesn't even run the query. My suspicion 
> is that there's a bug in constraint propagation or filter pushdown.
>