[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051829#comment-15051829 ] Min Qiu commented on SPARK-12032: - We had run into the same problem in our product development and we had also come up with a similar solution. Here is the pull request: [#10258| https://github.com/apache/spark/pull/10258] just in case that somebody is interested in it. > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator
[ https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036685#comment-15036685 ] Min Qiu commented on SPARK-12085: - looks like the BooleanSimplification rule in Spark 1.5 provides a general way to rewrite the predicate expression. It should covers my cases. Will test the query on Spark 1.5. > The join condition hidden in DNF can't be pushed down to join operator > --- > > Key: SPARK-12085 > URL: https://issues.apache.org/jira/browse/SPARK-12085 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Min Qiu > > TPC-H Q19: > {quote} > SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join > lineitem > WHERE ({color: red}p_partkey = l_partkey {color} >AND p_brand = 'Brand#12' >AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') >AND l_quantity >= 1 AND l_quantity <= 1 + 10 >AND p_size BETWEEN 1 AND 5 >AND l_shipmode IN ('AIR', 'AIR REG') >AND l_shipinstruct = 'DELIVER IN PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#23' >AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') >AND l_quantity >= 10 AND l_quantity <= 10 + 10 >AND p_size BETWEEN 1 AND 10 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#34' >AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') >AND l_quantity >= 20 AND l_quantity <= 20 + 10 >AND p_size BETWEEN 1 AND 15 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') > {quote} > The equality condition {color:red} p_partkey = l_partkey{color} matches the > join relations but it cannot be recogized by optimizer because it's hidden in > a disjunctive normal form. As a result the entire where clause will be in a > filter operator on top of the join operator where the join condition would be > "None" in the optimized plan. Finally the query planner will apply a > prohibitive expensive cartesian product on the physical plan which causes OOM > exception or very bad performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
Min Qiu created SPARK-12114: --- Summary: ColumnPruning rule fails in case of "Project <- Filter <- Join" Key: SPARK-12114 URL: https://issues.apache.org/jira/browse/SPARK-12114 Project: Spark Issue Type: Bug Components: SQL Reporter: Min Qiu For the query {quote} SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity) FROM customer join orders join lineitem on c_custkey = o_custkey AND o_orderkey = l_orderkey left outer join (SELECT l_orderkey tmp_orderkey FROM lineitem GROUP BY l_orderkey HAVING sum(l_quantity) > 300) tmp on o_orderkey = tmp_orderkey WHERE tmp_orderkey IS NOT NULL GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate {quote} The optimizedPlan is {quote} Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] Aggregate \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], \[c_name#38,c_custkey#37,o_orderkey#45, o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] {color: green}Project \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] Filter IS NOT NULL tmp_orderkey#35 Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} Join Inner, Some((c_custkey#37 = o_custkey#46)) MetastoreRelation default, customer, None Join Inner, Some((o_orderkey#45 = l_orderkey#54)) MetastoreRelation default, orders, None MetastoreRelation default, lineitem, None Project \[tmp_orderkey#35] Filter havingCondition#86 Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] Project \[l_orderkey#70,l_quantity#74] MetastoreRelation default, lineitem, None {quote} Due to the pattern highlighted in green that the ColumnPruning rule fails to deal with, all columns of lineitem and orders tables are scanned. The unneeded columns are also involved in the data Shuffling. The performance is extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Qiu updated SPARK-12114: External issue URL: (was: https://github.com/apache/spark/pull/10087) > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Min Qiu > > For the query > {quote} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, >o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {quote} > The optimizedPlan is > {quote} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] >Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) >MetastoreRelation default, orders, None >MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 >Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {quote} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037023#comment-15037023 ] Min Qiu commented on SPARK-12114: - The [pull request|https://github.com/apache/spark/pull/10087] is related to this issue, but the pull request contains changes to the other problem. I will work out a new pull request for this issue only. > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Min Qiu > > For the query > {quote} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, >o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {quote} > The optimizedPlan is > {quote} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] >Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) >MetastoreRelation default, orders, None >MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 >Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {quote} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"
[ https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Min Qiu updated SPARK-12114: External issue URL: https://github.com/apache/spark/pull/10087 > ColumnPruning rule fails in case of "Project <- Filter <- Join" > --- > > Key: SPARK-12114 > URL: https://issues.apache.org/jira/browse/SPARK-12114 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Min Qiu > > For the query > {quote} > SELECT c_name, c_custkey, o_orderkey, o_orderdate, >o_totalprice, sum(l_quantity) > FROM customer join orders join lineitem > on c_custkey = o_custkey AND o_orderkey = l_orderkey > left outer join (SELECT l_orderkey tmp_orderkey > FROM lineitem > GROUP BY l_orderkey > HAVING sum(l_quantity) > 300) tmp > on o_orderkey = tmp_orderkey > WHERE tmp_orderkey IS NOT NULL > GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice > ORDER BY o_totalprice DESC, o_orderdate > {quote} > The optimizedPlan is > {quote} > Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC] > > Aggregate > \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], > \[c_name#38,c_custkey#37,o_orderkey#45, > o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36] > {color: green}Project > \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58] >Filter IS NOT NULL tmp_orderkey#35 > Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color} > Join Inner, Some((c_custkey#37 = o_custkey#46)) > MetastoreRelation default, customer, None > Join Inner, Some((o_orderkey#45 = l_orderkey#54)) >MetastoreRelation default, orders, None >MetastoreRelation default, lineitem, None > Project \[tmp_orderkey#35] > Filter havingCondition#86 >Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS > havingCondition#86,l_orderkey#70 AS tmp_orderkey#35] > Project \[l_orderkey#70,l_quantity#74] > MetastoreRelation default, lineitem, None > {quote} > Due to the pattern highlighted in green that the ColumnPruning rule fails to > deal with, all columns of lineitem and orders tables are scanned. The > unneeded columns are also involved in the data Shuffling. The performance is > extremely bad if any one of the two tables is big. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator
Min Qiu created SPARK-12085: --- Summary: The join condition hidden in DNF can't be pushed down to join operator Key: SPARK-12085 URL: https://issues.apache.org/jira/browse/SPARK-12085 Project: Spark Issue Type: Improvement Components: SQL Reporter: Min Qiu TPC-H Q19: {quote} SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join lineitem WHERE ({color: red}p_partkey = l_partkey {color} AND p_brand = 'Brand#12' AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') AND l_quantity >= 1 AND l_quantity <= 1 + 10 AND p_size BETWEEN 1 AND 5 AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN PERSON') OR ({color: red}p_partkey = l_partkey{color} AND p_brand = 'Brand#23' AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') AND l_quantity >= 10 AND l_quantity <= 10 + 10 AND p_size BETWEEN 1 AND 10 AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN PERSON') OR ({color: red}p_partkey = l_partkey{color} AND p_brand = 'Brand#34' AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') AND l_quantity >= 20 AND l_quantity <= 20 + 10 AND p_size BETWEEN 1 AND 15 AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN PERSON') {quote} The equality condition {color:red} p_partkey = l_partkey{color} matches the join relations but it cannot be recogized by optimizer because it's hidden in a disjunctive normal form. As a result the entire where clause will be in a filter operator on top of the join operator where the join condition would be "None" in the optimized plan. Finally the query planner will apply a prohibitive expensive cartesian product on the physical plan which causes OOM exception or very bad performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12085) The join condition hidden in DNF can't be pushed down to join operator
[ https://issues.apache.org/jira/browse/SPARK-12085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035147#comment-15035147 ] Min Qiu commented on SPARK-12085: - just submitted a [pull request #10087|https://github.com/apache/spark/pull/10087] > The join condition hidden in DNF can't be pushed down to join operator > --- > > Key: SPARK-12085 > URL: https://issues.apache.org/jira/browse/SPARK-12085 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Min Qiu > > TPC-H Q19: > {quote} > SELECT sum(l_extendedprice * (1 - l_discount)) AS revenue FROM part join > lineitem > WHERE ({color: red}p_partkey = l_partkey {color} >AND p_brand = 'Brand#12' >AND p_container IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') >AND l_quantity >= 1 AND l_quantity <= 1 + 10 >AND p_size BETWEEN 1 AND 5 >AND l_shipmode IN ('AIR', 'AIR REG') >AND l_shipinstruct = 'DELIVER IN PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#23' >AND p_container IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') >AND l_quantity >= 10 AND l_quantity <= 10 + 10 >AND p_size BETWEEN 1 AND 10 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') >OR ({color: red}p_partkey = l_partkey{color} >AND p_brand = 'Brand#34' >AND p_container IN ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') >AND l_quantity >= 20 AND l_quantity <= 20 + 10 >AND p_size BETWEEN 1 AND 15 >AND l_shipmode IN ('AIR', 'AIR REG') AND l_shipinstruct = 'DELIVER IN > PERSON') > {quote} > The equality condition {color:red} p_partkey = l_partkey{color} matches the > join relations but it cannot be recogized by optimizer because it's hidden in > a disjunctive normal form. As a result the entire where clause will be in a > filter operator on top of the join operator where the join condition would be > "None" in the optimized plan. Finally the query planner will apply a > prohibitive expensive cartesian product on the physical plan which causes OOM > exception or very bad performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org