[ 
https://issues.apache.org/jira/browse/IMPALA-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltán Borók-Nagy updated IMPALA-8048:
--------------------------------------
    Epic Link: IMPALA-12674

> Improve join cardinality estimation: urn model, NDV tracking, etc.
> ------------------------------------------------------------------
>
>                 Key: IMPALA-8048
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8048
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Work is underway on a number of JIRA tickets to improve cardinality 
> estimates. That work is constrained by the possible need to back-port to 
> prior releases. As a result, the changes are made within the existing context 
> to minimize the impact.
> The current model makes a number of naive assumptions, however, that should 
> be addressed in a second batch of changes which will entail a wider code 
> impact.
> h4. Adopt the Urn Model for NDV Estimation.
> Suppose we have a table alumni(name, sex, class) with values such as:
> {noformat}
> John Smith, M, 2008
> Jane Doe, F, 1993
> ...
> {noformat}
> We have 50 years of data, 1000 rows per year, or 50K rows. We have these 
> stats:
> {noformat}
> |alumni| = 50K
> |name| = 49K
> |sex| = 2
> |class| = 50
> {noformat}
> We have the following query which fills in the graduation date for each class:
> {code:sql}
> select * from alumni, grad_dates where sex='F' where alumni.class = 
> grad_dates.class
> {code}
> Focusing just on the alumni table, how many classes will be available to 
> match? That is, what is {{|class'|}}, the NDV of the name field after 
> accounting for the affect of the predicate {{sex='F'}}.
> Today we work it out with a linear model as follows:
> {code}
> sel(sex = 'F') = 1/|F| = 1/2 = 0.5
> |sex'| = |sex| * sel(sex = 'F') = 2 * 0.5 = 1
> |class'| = |class| * sel(sex = 'F') = 50 * 0.5 = 25
> {code}
> The math works for the {{sex}} field: the correct adjusted NDV is 1.
> What about for {{class}}? Since the predicate eliminated half the rows, it 
> eliminated half the class values. But, this can't be right. Surely women 
> graduated in all classes. What went wrong?
> The problem is the linear assumption. As shown in the [SwamiI and 
> Schiefer|https://pdfs.semanticscholar.org/2735/672262940a23cfce8d4cc1e25c0191de7da7.pdf]
>  paper, Section 5, the correct estimation technique is the urn model. See the 
> paper for details. Using that model:
> {noformat}
> |x'| = (1 - (1 - 1/|x|)^|T')
> |alumni'| = |alumni| * sel(sex = 'F') = 50K * .5 = 25K
> |class'| = |class| * (1 - (1 - 1/50) ^ 25K) = 50
> {noformat}
> That is, as the cardinality of the selected table grows larger, the 
> probability reaches 1 that other, non-correlated values will still appear. 
> This, though we remove half the rows, all the classes are still represented.
> h4. Per-Tuple Column NDV Tracking
> At present, after the current round of changes, we use a linear model to 
> estimate column NDV after filtering, and use the same model for all columns. 
> If we adopt the urn model, then we must treat columns separately. In the 
> above, we do *not* want to apply the urn model to the {{sex}} column. Why? We 
> already know its cardinality from the filter predicate. Don't want to replace 
> it with an estimated urn-model value. This problem is more acute if you 
> consider a range predicate, such as those used on partitions: {{class > 
> 2009}}.
> To make the above work, we have to track NDV per column. That is, the scan 
> node must provide a list of columns and their NDVs after scanning. Columns 
> mentioned in a predicate have their NDVs estimated from selectivity. All 
> other columns have their NDVs estimated from the urn model. (There are 
> several ways to implement this; the point is that some columns must be 
> singled out for special treatment.)
> h4. Proper Join-to-Table Join Column NDVs
> The NDV adjustment model says that, to compute the join cardinality, we need 
> the adjusted column cardinality (NDV). When joining one table to another, it 
> is clear how to adjust the column NDVs for each table: each is done according 
> to the rules spelled out above.
> A complexity arises, however, when we want to join three tables: we have ((A 
> ⋈ B) ⋈ C). How do we adjust the NDVs for the columns created by the (A ⋈ B) 
> join? If we simply adjust the NDV of table columns using a common selectivity 
> (as done in the simple linear model), then we are correct for the columns 
> from one table, but wrong for columns from the other. Why? The two table had 
> different selectivities applied, we can't reduce them to a common number. 
> The solution is the per-column adjusted NDV tracking: we'd know to apply one 
> set of adjustments for columns from the left table, another for the right.
> This requires additional data structures in each plan node.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to