[ https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paul Rogers updated IMPALA-7601: -------------------------------- Summary: Improve cardinality and selectivity estimates (was: Improve default selectivity values) > Improve cardinality and selectivity estimates > --------------------------------------------- > > Key: IMPALA-7601 > URL: https://issues.apache.org/jira/browse/IMPALA-7601 > Project: IMPALA > Issue Type: Improvement > Components: Frontend > Affects Versions: Impala 2.12.0 > Reporter: Paul Rogers > Assignee: Paul Rogers > Priority: Major > > Impala makes extensive use of table stats during query planning. For example, > the NDV (number of distinct values) is used to compute _selectivity_, the > degree of reduction (also called the _reduction factor_) provided by a > predicate. For example: > {noformat} > SELECT * FROM t WHERE t.a = 10 > {noformat} > If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform > distribution of values), that the above query will pick out one of these 100 > values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity > of the predicate {{t.a = 10}} is 0.01. > For predicate operators other than equality ({{=}}), Impala uses 0.1 for the > estimated selectivity. In many cases, however, Impala uses no estimate at all. > There are three observations: > * Impala does use a-priority selectivity estimates for these operators (NDV > does not help with inequality, for example.) > * The 0.01 is a bit too much of a one-size-fits-all estimate. > * The lack of selectivity or cardinality estimate can lead to poor plans. > h4. Problem Summary > The goal of a planner is to estimate the cardinality of various relations > (sets of tuples) in order to choose a good plan (which relation to put on the > probe side of a join) and for memory estimates (about how much memory is > needed for a sort?) When planning, even crude estimates are fine as the > planner only chooses among a discrete set of options. The smaller table goes > on the probe side of a join, it does not matter _how much_ smaller one table > is than the other. > It turns out, however, that Impala is quite finicky: refusing to render an > cardinality estimate if certain information is missing. This means that, > rather than use a poor estimate, Impala would rather use no estimate at all. > Rather than produce a crude attempt at a plan, Impala flies blind in this > cases. > You can see this in > [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java]. > The result is that Impala is a bit more strict than classic DB optimizers. If > stats are present, they are used. If stats are not present, Impala flies > blind. > This ticket proposal a number of interrelated changes to add a-priori (before > observation) defaults for selectivity and NDV based on classic DB practice. > Also, some of our existing estimates are overly simplified as discussed below. > h4. A Bit of Math > Selectivity is a probability: the probability that an Boolean expression > evaluates to true. That is: > {noformat} > selectivity c = x = p( (c = x) = true) = p(c = x) > {noformat} > Where {{c}} is a column and {{x}} is a constant. > Knowing this can help pick good selectivity values, and makes the following > discussion a bit simpler. > h4. Use NDV to Compute {{p(c != x)}} > Impala uses NDV to compute {{p(c=x)}}: > {noformat} > p(c = x) = 1 / NDV(c) > {noformat} > As described in IMPALA-7560, Impala uses a selectivity of 0.1 for {{p(c != > x)}}. There is, however, a mathematical relationship between these operators > we can exploit: > {noformat} > p(c != x) = 1 - p(c = x) > {noformat} > That is, if we have {{NDV(c)}} and can compute {{p(c = x)}}, we can also > compute {{p(c != x)}}. > h4. Refine Default Selectivity Values > Further, Impala assumes a selectivity 0.1 for all other relational operators > except equality. See > [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]: > {noformat} > // To be used where we cannot come up with a > // better estimate (selectivity_ is -1). > public static double DEFAULT_SELECTIVITY = 0.1; > {noformat} > There is another mathematical relationship between these operators we can > exploit to refine some of these estimates: > {noformat} > p(c != x) = p(c < x) + p(c > x) > {noformat} > As it turns out, we don't need the NDV to compute an inequality. We can > simply observe that the inequalities roughly divide the data into two sets, > of which the user picks one. The current default of 0.1 is probably too low, > and 0.5 is probably too high. The [Ramakrishnan and Gehrke > book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] > suggests rule-of-thumb estimates: > * {{p(c < x)}}, {{p(c <= x)}}, {{p(c > x)}}, {{p(c >= x)}} - 0.3 > * {{p(c BETWEEN x AND y)}} - 0.25 > h4. Default Selectivity for {{p(c = x)}} and {{p(c != x)}} Without Stats > All this is good. But, what happens if statistics are not available for table > {{t}}? How are we to know the selectivity of the predicate? > Today, Impala simply refuses to pick a selectivity for {{p(c = x)}} if there > are no stats. See > [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java]. > However, even without stats, Impala continues to use its defaults for all > other relational operators. While it is true that, without NDV, we can't be > positive of the selectivity of {{p(c = x)}}, it is also true we are happy to > guess for all other operators. > The [Ramakrishnan and Gehrke > book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] > shows that the standard choice is to assume a selectivity of 0.1. > Over in the Drill project, DRILL-5254 attempted to work out better estimates > for other operators based on math and probability. However, the conclusion > there was that, without NDV and histograms, there is more information in the > user's intent than in the math. That is, if the user writes {{WHERE t.a \!= > 10}}, there is a conditional probability that the user believes that this is > a highly restrictive predicate, especially on big data. So, the reduction > factor (which is a probability) is the same for {{=}} and {{!=}} in the > absence of information. The same reasoning probably led to the rule-of-thumb > values in the > [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] > book. > So, if no stats are available, use a default number for {{p(c != x)}} such as > 0.1 or 0.7. (The larger number is probably better.) > As a result of this change, we will always have a selectivity, even if just a > rule-of-thumb one. That is, some information is better than none. > h4. Special Case Boolean Columns > There is one special case it may be worth exploiting. If a column is Boolean, > then we know that it can have at most three values (true, false, null). So, > even without status, we can estimate {{NDV(c)}} = 3. This allows us to get > better estimates for {{p(c = x)}} and {{p(c != x)}} than we'd get just using > the defaults. > h4. Separate the NDV=0 and Unknown NDV cases > As described in IMPALA-7310, Impala does not currently handle the case of a > table with stats, but a column in the table contains only NULL values. The > stats mechanism defines NDV as "number of distinct non-null values". So, a > table full of nulls has NDV = 0. Unfortunately, if no stats are available at > all, we also have NDV = 0. > IMPALA-7310 will fix this issue, removing one more case in which Impala did > not produce a cardinality estimate. > h4. Rationalize the Planner's and Stat's NDV Definition > Related to the above issue is the fact that the planner and the stats > mechanisms use different definitions for NDV. > * Planner: NDV includes nulls. This is seen when computing the NDV for > constant expressions. > * NDV function and thus stats: NDV does not include nulls. > Currently, the planner takes the stat's "without nulls" NDV and mixes it with > the planer's "with nulls" definition. IMPALA-7310 proposes a way to convert > the stat's definition to the planner's definition. > h4. Estimate Row Counts > The above will go a long way to ensure that the planner always renders a > cardinality estimate. But, one hole remains. IMPALA-7608 says that > cardinality estimates are undefined if stats are unavailable because we don't > know row counts. IMPALA-7608 explains a useful, common rule-of-thumb. Add > that and we'd have covered all situations in which Impala currently refuses > to produce a cardinality estimate. > h.4 Handle Large Cardinality > IMPALA-7604 notes that we use a {{long}} to hold a cardinality estimate. This > can overflow, causing incorrect estimates. As we add the above, we will put > more confidence in the cardinality estimates. We need to fix the overflow as > it can reduce the value of the cardinality by producing wrong numbers in > extreme cases. > h4. Remove Special Handling for Undefined Selectivity and Cardinality > Once all of the above are available, every plan node will compute a > cardinality. The existing code to handle cardinality = -1 (undefined) can be > removed. Tests can be updated to enforce that all plans will have an > estimated cardinality. > h4. References > * [Cost Estimation > Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf] > * [Reduction > factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html] > from the classic Ramakrishnan and Gehrke book. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org