[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

2016-05-28 Thread Jesus Camacho Rodriguez (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesus Camacho Rodriguez updated HIVE-13254:
---
Target Version/s:   (was: 2.1.0)

> GBY cardinality estimation is wrong partition columns is involved
> -
>
> Key: HIVE-13254
> URL: https://issues.apache.org/jira/browse/HIVE-13254
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Prasanth Jayachandran
>Assignee: Jesus Camacho Rodriguez
> Attachments: q3.svg, q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting 
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was 
> not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>`item`.`i_category` AS `i_category` 
> FROM   `store_sales` `store_sales` 
>INNER JOIN `item` `item` 
>ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>INNER JOIN `date_dim` `date_dim` 
>ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` 
> ) 
> GROUP  BY `d_day_name`, 
>   `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering 
> partition column into account. For the above query, the generated plan is 
> attached. As we can see from the plan, GBY is pushed to fact table 
> (store_sales) but that output of GBY shuffled to perform join instead of 
> MapJoin conversion. This is because of wrong estimation of cardinality/data 
> size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in 
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. 
> This estimate is wrong as ss_sold_date_sk is partition column and estimator 
> assumes its non-partition column. In this case, not every tasks reads data 
> from all partitions. We need to take estimated task parallelism into account. 
> For example: If task parallelism is determined to be 100 the estimate from 
> GBY should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

2016-03-21 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-13254:
-
Attachment: q3.svg

> GBY cardinality estimation is wrong partition columns is involved
> -
>
> Key: HIVE-13254
> URL: https://issues.apache.org/jira/browse/HIVE-13254
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Prasanth Jayachandran
>Assignee: Jesus Camacho Rodriguez
> Attachments: q3.svg, q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting 
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was 
> not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>`item`.`i_category` AS `i_category` 
> FROM   `store_sales` `store_sales` 
>INNER JOIN `item` `item` 
>ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>INNER JOIN `date_dim` `date_dim` 
>ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` 
> ) 
> GROUP  BY `d_day_name`, 
>   `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering 
> partition column into account. For the above query, the generated plan is 
> attached. As we can see from the plan, GBY is pushed to fact table 
> (store_sales) but that output of GBY shuffled to perform join instead of 
> MapJoin conversion. This is because of wrong estimation of cardinality/data 
> size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in 
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. 
> This estimate is wrong as ss_sold_date_sk is partition column and estimator 
> assumes its non-partition column. In this case, not every tasks reads data 
> from all partitions. We need to take estimated task parallelism into account. 
> For example: If task parallelism is determined to be 100 the estimate from 
> GBY should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

2016-03-09 Thread Prasanth Jayachandran (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prasanth Jayachandran updated HIVE-13254:
-
Attachment: q3_ef_transpose_aggr.svg

> GBY cardinality estimation is wrong partition columns is involved
> -
>
> Key: HIVE-13254
> URL: https://issues.apache.org/jira/browse/HIVE-13254
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 1.3.0, 2.0.0, 2.1.0
>Reporter: Prasanth Jayachandran
> Attachments: q3_ef_transpose_aggr.svg
>
>
> When running the following query on TPCDS-1000 scale, setting 
> hive.transpose.aggr.join=true is expected to generate optimal plan but it was 
> not generating. 
> {code:title=Query}
> SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
>`item`.`i_category` AS `i_category` 
> FROM   `store_sales` `store_sales` 
>INNER JOIN `item` `item` 
>ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
>INNER JOIN `date_dim` `date_dim` 
>ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` 
> ) 
> GROUP  BY `d_day_name`, 
>   `i_category`;
> {code}
> The reason for that is stats annotation rule for GROUP BY is not considering 
> partition column into account. For the above query, the generated plan is 
> attached. As we can see from the plan, GBY is pushed to fact table 
> (store_sales) but that output of GBY shuffled to perform join instead of 
> MapJoin conversion. This is because of wrong estimation of cardinality/data 
> size of GBY on store_sales (Map 1). 
> What's happening internally is, GBY computes estimated cardinality which in 
> this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. 
> This estimate is wrong as ss_sold_date_sk is partition column and estimator 
> assumes its non-partition column. In this case, not every tasks reads data 
> from all partitions. We need to take estimated task parallelism into account. 
> For example: If task parallelism is determined to be 100 the estimate from 
> GBY should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)