[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332721#comment-15332721 ] Apache Spark commented on SPARK-12032: -- User 'flyson' has created a pull request for this issue: https://github.com/apache/spark/pull/10258 > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15051829#comment-15051829 ] Min Qiu commented on SPARK-12032: - We had run into the same problem in our product development and we had also come up with a similar solution. Here is the pull request: [#10258| https://github.com/apache/spark/pull/10258] just in case that somebody is interested in it. > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > Fix For: 2.0.0 > > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034545#comment-15034545 ] Apache Spark commented on SPARK-12032: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10073 > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033187#comment-15033187 ] Reynold Xin commented on SPARK-12032: - [~marmbrus] do you mean the selinger algo? > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032403#comment-15032403 ] Michael Armbrust commented on SPARK-12032: -- The standard algorithm for join reordering should handle this case, but adding it in is probably going to be a bit of work. > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join
[ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032217#comment-15032217 ] Davies Liu commented on SPARK-12032: [~marmbrus] Do you have some idea how to fix this? I had not figured out a clean way. > Filter can't be pushed down to correct Join because of bad order of Join > > > Key: SPARK-12032 > URL: https://issues.apache.org/jira/browse/SPARK-12032 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Priority: Critical > > For this query: > {code} > select d.d_year, count(*) cnt >FROM store_sales, date_dim d, customer c >WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = > d.d_date_sk >group by d.d_year > {code} > Current optimized plan is > {code} > == Optimized Logical Plan == > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && > (c_first_shipto_date_sk#106 = d_date_sk#141))) >Project [d_date_sk#141,d_year#147,ss_customer_sk#283] > Join Inner, None > Project [ss_customer_sk#283] > Relation[] ParquetRelation[store_sales] > Project [d_date_sk#141,d_year#147] > Relation[] ParquetRelation[date_dim] >Project [c_customer_sk#101,c_first_shipto_date_sk#106] > Relation[] ParquetRelation[customer] > {code} > It will join store_sales and date_dim together without any condition, the > condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because > the bad order of joins. > The optimizer should re-order the joins, join date_dim after customer, then > it can pushed down the condition correctly. > The plan should be > {code} > Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) > AS cnt#425L] > Project [d_year#147] > Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141)) >Project [c_first_shipto_date_sk#106] > Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101)) > Project [ss_customer_sk#283] > Relation[store_sales] > Project [c_first_shipto_date_sk#106,c_customer_sk#101] > Relation[customer] >Project [d_year#147,d_date_sk#141] > Relation[date_dim] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org