[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved
[ https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesus Camacho Rodriguez updated HIVE-13254: --- Target Version/s: (was: 2.1.0) > GBY cardinality estimation is wrong partition columns is involved > - > > Key: HIVE-13254 > URL: https://issues.apache.org/jira/browse/HIVE-13254 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 1.3.0, 2.0.0, 2.1.0 >Reporter: Prasanth Jayachandran >Assignee: Jesus Camacho Rodriguez > Attachments: q3.svg, q3_ef_transpose_aggr.svg > > > When running the following query on TPCDS-1000 scale, setting > hive.transpose.aggr.join=true is expected to generate optimal plan but it was > not generating. > {code:title=Query} > SELECT `date_dim`.`d_day_name` AS `d_day_name`, >`item`.`i_category` AS `i_category` > FROM `store_sales` `store_sales` >INNER JOIN `item` `item` >ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) >INNER JOIN `date_dim` `date_dim` >ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` > ) > GROUP BY `d_day_name`, > `i_category`; > {code} > The reason for that is stats annotation rule for GROUP BY is not considering > partition column into account. For the above query, the generated plan is > attached. As we can see from the plan, GBY is pushed to fact table > (store_sales) but that output of GBY shuffled to perform join instead of > MapJoin conversion. This is because of wrong estimation of cardinality/data > size of GBY on store_sales (Map 1). > What's happening internally is, GBY computes estimated cardinality which in > this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. > This estimate is wrong as ss_sold_date_sk is partition column and estimator > assumes its non-partition column. In this case, not every tasks reads data > from all partitions. We need to take estimated task parallelism into account. > For example: If task parallelism is determined to be 100 the estimate from > GBY should be ~6M which should convert this vertex into map join vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved
[ https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth Jayachandran updated HIVE-13254: - Attachment: q3.svg > GBY cardinality estimation is wrong partition columns is involved > - > > Key: HIVE-13254 > URL: https://issues.apache.org/jira/browse/HIVE-13254 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 1.3.0, 2.0.0, 2.1.0 >Reporter: Prasanth Jayachandran >Assignee: Jesus Camacho Rodriguez > Attachments: q3.svg, q3_ef_transpose_aggr.svg > > > When running the following query on TPCDS-1000 scale, setting > hive.transpose.aggr.join=true is expected to generate optimal plan but it was > not generating. > {code:title=Query} > SELECT `date_dim`.`d_day_name` AS `d_day_name`, >`item`.`i_category` AS `i_category` > FROM `store_sales` `store_sales` >INNER JOIN `item` `item` >ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) >INNER JOIN `date_dim` `date_dim` >ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` > ) > GROUP BY `d_day_name`, > `i_category`; > {code} > The reason for that is stats annotation rule for GROUP BY is not considering > partition column into account. For the above query, the generated plan is > attached. As we can see from the plan, GBY is pushed to fact table > (store_sales) but that output of GBY shuffled to perform join instead of > MapJoin conversion. This is because of wrong estimation of cardinality/data > size of GBY on store_sales (Map 1). > What's happening internally is, GBY computes estimated cardinality which in > this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. > This estimate is wrong as ss_sold_date_sk is partition column and estimator > assumes its non-partition column. In this case, not every tasks reads data > from all partitions. We need to take estimated task parallelism into account. > For example: If task parallelism is determined to be 100 the estimate from > GBY should be ~6M which should convert this vertex into map join vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved
[ https://issues.apache.org/jira/browse/HIVE-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasanth Jayachandran updated HIVE-13254: - Attachment: q3_ef_transpose_aggr.svg > GBY cardinality estimation is wrong partition columns is involved > - > > Key: HIVE-13254 > URL: https://issues.apache.org/jira/browse/HIVE-13254 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 1.3.0, 2.0.0, 2.1.0 >Reporter: Prasanth Jayachandran > Attachments: q3_ef_transpose_aggr.svg > > > When running the following query on TPCDS-1000 scale, setting > hive.transpose.aggr.join=true is expected to generate optimal plan but it was > not generating. > {code:title=Query} > SELECT `date_dim`.`d_day_name` AS `d_day_name`, >`item`.`i_category` AS `i_category` > FROM `store_sales` `store_sales` >INNER JOIN `item` `item` >ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) >INNER JOIN `date_dim` `date_dim` >ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` > ) > GROUP BY `d_day_name`, > `i_category`; > {code} > The reason for that is stats annotation rule for GROUP BY is not considering > partition column into account. For the above query, the generated plan is > attached. As we can see from the plan, GBY is pushed to fact table > (store_sales) but that output of GBY shuffled to perform join instead of > MapJoin conversion. This is because of wrong estimation of cardinality/data > size of GBY on store_sales (Map 1). > What's happening internally is, GBY computes estimated cardinality which in > this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. > This estimate is wrong as ss_sold_date_sk is partition column and estimator > assumes its non-partition column. In this case, not every tasks reads data > from all partitions. We need to take estimated task parallelism into account. > For example: If task parallelism is determined to be 100 the estimate from > GBY should be ~6M which should convert this vertex into map join vertex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)