[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-12-10 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051829#comment-15051829
 ] 

Min Qiu commented on SPARK-12032:
-

We had run into the same problem in our product development and we had also 
come up with a similar solution.
Here is the pull request: [#10258| https://github.com/apache/spark/pull/10258] 
just in case that somebody is interested in it.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator

2015-12-02 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036685#comment-15036685
 ] 

Min Qiu commented on SPARK-12085:
-

looks like the BooleanSimplification rule in Spark 1.5 provides a general way 
to rewrite the predicate expression. It should covers my cases. Will test the 
query on Spark 1.5.  

> The join condition hidden in DNF can't be pushed down to join operator 
> ---
>
> Key: SPARK-12085
> URL: https://issues.apache.org/jira/browse/SPARK-12085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Min Qiu
>
> TPC-H Q19:
> {quote}
> SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join 
> lineitem 
> WHERE ({color: red}p_partkey = l_partkey {color}
>AND p_brand = 'Brand#12' 
>AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') 
>AND l_quantity >= 1 AND l_quantity <= 1 + 10 
>AND p_size BETWEEN 1 AND 5 
>AND l_shipmode IN ('AIR', 'AIR REG') 
>AND l_shipinstruct = 'DELIVER IN PERSON') 
>OR ({color: red}p_partkey = l_partkey{color}
>AND p_brand = 'Brand#23' 
>AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') 
>AND l_quantity >= 10 AND l_quantity <= 10 + 10 
>AND p_size BETWEEN 1 AND 10 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON') 
>OR ({color: red}p_partkey = l_partkey{color} 
>AND p_brand = 'Brand#34' 
>AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') 
>AND l_quantity >= 20 AND l_quantity <= 20 + 10 
>AND p_size BETWEEN 1 AND 15 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON')
> {quote}
> The equality condition {color:red} p_partkey = l_partkey{color} matches the 
> join relations but it cannot be recogized by optimizer because it's hidden in 
> a disjunctive normal form.  As a result the entire where clause will be in a 
> filter operator on top of the join operator where the join condition would be 
> "None" in the optimized plan. Finally the query planner will apply a 
> prohibitive expensive cartesian product on the physical plan which causes OOM 
> exception or very bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"

2015-12-02 Thread Min Qiu (JIRA)
Min Qiu created SPARK-12114:
---

 Summary: ColumnPruning rule fails in case of "Project <- Filter <- 
Join"
 Key: SPARK-12114
 URL: https://issues.apache.org/jira/browse/SPARK-12114
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Min Qiu


For the query
{quote}
SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
   o_totalprice, sum(l_quantity) 
FROM customer join orders join lineitem 
  on c_custkey = o_custkey AND o_orderkey = l_orderkey 
 left outer join (SELECT l_orderkey tmp_orderkey 
  FROM lineitem 
  GROUP BY l_orderkey 
  HAVING sum(l_quantity) > 300) tmp 
  on o_orderkey = tmp_orderkey 
WHERE tmp_orderkey IS NOT NULL 
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
ORDER BY o_totalprice DESC, o_orderdate
{quote}

The optimizedPlan is 
{quote}
Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
 
 Aggregate 
\[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
\[c_name#38,c_custkey#37,o_orderkey#45,
o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
  {color: green}Project 
\[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
   Filter IS NOT NULL tmp_orderkey#35
Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
 Join Inner, Some((c_custkey#37 = o_custkey#46))
  MetastoreRelation default, customer, None
  Join Inner, Some((o_orderkey#45 = l_orderkey#54))
   MetastoreRelation default, orders, None
   MetastoreRelation default, lineitem, None
 Project \[tmp_orderkey#35]
  Filter havingCondition#86
   Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
Project \[l_orderkey#70,l_quantity#74]
 MetastoreRelation default, lineitem, None
{quote}

Due to the pattern highlighted in green that the ColumnPruning rule fails to 
deal with,  all columns of lineitem and orders tables are scanned. The unneeded 
columns are also involved in the data Shuffling. The performance is extremely 
bad if any one of the two tables is big.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"

2015-12-02 Thread Min Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Qiu updated SPARK-12114:

External issue URL:   (was: https://github.com/apache/spark/pull/10087)

> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Min Qiu
>
> For the query
> {quote}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
>o_totalprice, sum(l_quantity) 
> FROM customer join orders join lineitem 
>   on c_custkey = o_custkey AND o_orderkey = l_orderkey 
>  left outer join (SELECT l_orderkey tmp_orderkey 
>   FROM lineitem 
>   GROUP BY l_orderkey 
>   HAVING sum(l_quantity) > 300) tmp 
>   on o_orderkey = tmp_orderkey 
> WHERE tmp_orderkey IS NOT NULL 
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
> ORDER BY o_totalprice DESC, o_orderdate
> {quote}
> The optimizedPlan is 
> {quote}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>  
>  Aggregate 
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
>   {color: green}Project 
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
>Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
>  Join Inner, Some((c_custkey#37 = o_custkey#46))
>   MetastoreRelation default, customer, None
>   Join Inner, Some((o_orderkey#45 = l_orderkey#54))
>MetastoreRelation default, orders, None
>MetastoreRelation default, lineitem, None
>  Project \[tmp_orderkey#35]
>   Filter havingCondition#86
>Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
>  MetastoreRelation default, lineitem, None
> {quote}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to 
> deal with,  all columns of lineitem and orders tables are scanned. The 
> unneeded columns are also involved in the data Shuffling. The performance is 
> extremely bad if any one of the two tables is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"

2015-12-02 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037023#comment-15037023
 ] 

Min Qiu commented on SPARK-12114:
-

The [pull request|https://github.com/apache/spark/pull/10087] is related to 
this issue, but the pull request contains changes to the other problem. I will 
work out a new pull request for this issue only.

> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Min Qiu
>
> For the query
> {quote}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
>o_totalprice, sum(l_quantity) 
> FROM customer join orders join lineitem 
>   on c_custkey = o_custkey AND o_orderkey = l_orderkey 
>  left outer join (SELECT l_orderkey tmp_orderkey 
>   FROM lineitem 
>   GROUP BY l_orderkey 
>   HAVING sum(l_quantity) > 300) tmp 
>   on o_orderkey = tmp_orderkey 
> WHERE tmp_orderkey IS NOT NULL 
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
> ORDER BY o_totalprice DESC, o_orderdate
> {quote}
> The optimizedPlan is 
> {quote}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>  
>  Aggregate 
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
>   {color: green}Project 
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
>Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
>  Join Inner, Some((c_custkey#37 = o_custkey#46))
>   MetastoreRelation default, customer, None
>   Join Inner, Some((o_orderkey#45 = l_orderkey#54))
>MetastoreRelation default, orders, None
>MetastoreRelation default, lineitem, None
>  Project \[tmp_orderkey#35]
>   Filter havingCondition#86
>Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
>  MetastoreRelation default, lineitem, None
> {quote}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to 
> deal with,  all columns of lineitem and orders tables are scanned. The 
> unneeded columns are also involved in the data Shuffling. The performance is 
> extremely bad if any one of the two tables is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"

2015-12-02 Thread Min Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Min Qiu updated SPARK-12114:

External issue URL: https://github.com/apache/spark/pull/10087

> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Min Qiu
>
> For the query
> {quote}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
>o_totalprice, sum(l_quantity) 
> FROM customer join orders join lineitem 
>   on c_custkey = o_custkey AND o_orderkey = l_orderkey 
>  left outer join (SELECT l_orderkey tmp_orderkey 
>   FROM lineitem 
>   GROUP BY l_orderkey 
>   HAVING sum(l_quantity) > 300) tmp 
>   on o_orderkey = tmp_orderkey 
> WHERE tmp_orderkey IS NOT NULL 
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
> ORDER BY o_totalprice DESC, o_orderdate
> {quote}
> The optimizedPlan is 
> {quote}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>  
>  Aggregate 
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
>   {color: green}Project 
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
>Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
>  Join Inner, Some((c_custkey#37 = o_custkey#46))
>   MetastoreRelation default, customer, None
>   Join Inner, Some((o_orderkey#45 = l_orderkey#54))
>MetastoreRelation default, orders, None
>MetastoreRelation default, lineitem, None
>  Project \[tmp_orderkey#35]
>   Filter havingCondition#86
>Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
>  MetastoreRelation default, lineitem, None
> {quote}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to 
> deal with,  all columns of lineitem and orders tables are scanned. The 
> unneeded columns are also involved in the data Shuffling. The performance is 
> extremely bad if any one of the two tables is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator

2015-12-01 Thread Min Qiu (JIRA)
Min Qiu created SPARK-12085:
---

 Summary: The join condition hidden in DNF can't be pushed down to 
join operator 
 Key: SPARK-12085
 URL: https://issues.apache.org/jira/browse/SPARK-12085
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Min Qiu


TPC-H Q19:
{quote}
SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join 
lineitem 
WHERE ({color: red}p_partkey = l_partkey {color}
   AND p_brand = 'Brand#12' 
   AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') 
   AND l_quantity >= 1 AND l_quantity <= 1 + 10 
   AND p_size BETWEEN 1 AND 5 
   AND l_shipmode IN ('AIR', 'AIR REG') 
   AND l_shipinstruct = 'DELIVER IN PERSON') 
   OR ({color: red}p_partkey = l_partkey{color}
   AND p_brand = 'Brand#23' 
   AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') 
   AND l_quantity >= 10 AND l_quantity <= 10 + 10 
   AND p_size BETWEEN 1 AND 10 
   AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
PERSON') 
   OR ({color: red}p_partkey = l_partkey{color} 
   AND p_brand = 'Brand#34' 
   AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') 
   AND l_quantity >= 20 AND l_quantity <= 20 + 10 
   AND p_size BETWEEN 1 AND 15 
   AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
PERSON')
{quote}

The equality condition {color:red} p_partkey = l_partkey{color} matches the 
join relations but it cannot be recogized by optimizer because it's hidden in a 
disjunctive normal form.  As a result the entire where clause will be in a 
filter operator on top of the join operator where the join condition would be 
"None" in the optimized plan. Finally the query planner will apply a 
prohibitive expensive cartesian product on the physical plan which causes OOM 
exception or very bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator

2015-12-01 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035147#comment-15035147
 ] 

Min Qiu commented on SPARK-12085:
-

just submitted a [pull request 
#10087|https://github.com/apache/spark/pull/10087]

> The join condition hidden in DNF can't be pushed down to join operator 
> ---
>
> Key: SPARK-12085
> URL: https://issues.apache.org/jira/browse/SPARK-12085
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Min Qiu
>
> TPC-H Q19:
> {quote}
> SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join 
> lineitem 
> WHERE ({color: red}p_partkey = l_partkey {color}
>AND p_brand = 'Brand#12' 
>AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') 
>AND l_quantity >= 1 AND l_quantity <= 1 + 10 
>AND p_size BETWEEN 1 AND 5 
>AND l_shipmode IN ('AIR', 'AIR REG') 
>AND l_shipinstruct = 'DELIVER IN PERSON') 
>OR ({color: red}p_partkey = l_partkey{color}
>AND p_brand = 'Brand#23' 
>AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') 
>AND l_quantity >= 10 AND l_quantity <= 10 + 10 
>AND p_size BETWEEN 1 AND 10 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON') 
>OR ({color: red}p_partkey = l_partkey{color} 
>AND p_brand = 'Brand#34' 
>AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') 
>AND l_quantity >= 20 AND l_quantity <= 20 + 10 
>AND p_size BETWEEN 1 AND 15 
>AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN 
> PERSON')
> {quote}
> The equality condition {color:red} p_partkey = l_partkey{color} matches the 
> join relations but it cannot be recogized by optimizer because it's hidden in 
> a disjunctive normal form.  As a result the entire where clause will be in a 
> filter operator on top of the join operator where the join condition would be 
> "None" in the optimized plan. Finally the query planner will apply a 
> prohibitive expensive cartesian product on the physical plan which causes OOM 
> exception or very bad performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org