[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2016-06-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332721#comment-15332721
 ] 

Apache Spark commented on SPARK-12032:
--

User 'flyson' has created a pull request for this issue:
https://github.com/apache/spark/pull/10258

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-12-10 Thread Min Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15051829#comment-15051829
 ] 

Min Qiu commented on SPARK-12032:
-

We had run into the same problem in our product development and we had also 
come up with a similar solution.
Here is the pull request: [#10258| https://github.com/apache/spark/pull/10258] 
just in case that somebody is interested in it.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034545#comment-15034545
 ] 

Apache Spark commented on SPARK-12032:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10073

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032217#comment-15032217
 ] 

Davies Liu commented on SPARK-12032:


[~marmbrus] Do you have some idea how to fix this? I had not figured out a 
clean way.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032403#comment-15032403
 ] 

Michael Armbrust commented on SPARK-12032:
--

The standard algorithm for join reordering should handle this case, but adding 
it in is probably going to be a bit of work.

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2015-11-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033187#comment-15033187
 ] 

Reynold Xin commented on SPARK-12032:
-

[~marmbrus] do you mean the selinger algo?


> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org