[
https://issues.apache.org/jira/browse/HIVE-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mostafa Mokhtar updated HIVE-7913:
----------------------------------
Assignee: Laljo John Pullokkaran
> Simplify predicates for CBO
> ---------------------------
>
> Key: HIVE-7913
> URL: https://issues.apache.org/jira/browse/HIVE-7913
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 0.13.1
> Reporter: Mostafa Mokhtar
> Assignee: Laljo John Pullokkaran
> Fix For: 0.14.0
>
>
> I noticed that the estimate number of rows in Map joins is higher after the
> join than before the join that is with column stats fetch ON or OFF.
> TPC-DS Q55 was a good example for that, the issue is that the current
> statistics provide us enough information that we can estimate with strong
> confidence that the joins are one to many and not many to many.
> Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV,
> min and max values for both join columns match while the row counts are
> different this pattern indicates a PK/FK relationship between store_sales and
> item.
> Yet when a filter is applied on item and reduces the number of rows from 462K
> to 7K we estimate a many to many join between the filtered item and
> store_sales and as a result the estimate number of rows coming out of the
> join is off by several orders of magnitude.
> Available information from the stats
> {code}
> Table Join column NDV from describe NDV actual
> min max
> item i_item_sk 439,501 462,000
> 1 462,000
> date_dim d_date_sk 65,332 73,049
> 2,415,022 2,488,070
> store_sales ss_item_sk 439,501 462,000
> 1 462,000
> store_sales ss_sold_date_sk 2,226 1,823
> 2,450,816 2,452,642
> {code}
> Same thing applies to store_sales and date_dim but with a caveat that the NDV
> , min and max values don't match where date_dim has a bigger domain and
> accordingly a higher NDV count.
> For joining store_sales and item on on ss_item_sk = i_item_sk since both
> columns have the same NDV, min and max values we can safely conclude that
> selectivity on item will translate to similar selectivity on store_sales.
> This is not the case for joining store_sales and date_dim on ss_sold_date_sk
> = d_date_sk since the domain of d_date_sk is much bigger than that of
> ss_sold_date_sk, differences in domain need to be taken into account when
> inferring selectivity onto store_sales.
--
This message was sent by Atlassian JIRA
(v6.2#6252)