[ 
https://issues.apache.org/jira/browse/IMPALA-7601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622709#comment-16622709
 ] 

Paul Rogers commented on IMPALA-7601:
-------------------------------------

Please see [~tarmstr...@cloudera.com]'s comment in IMPALA-7310 for a bit of 
history.

To address the points there:

1. Having defaults makes testing *easier* because tests can be written using 
the defaults. Such tests focus on verifying that values propagate up the plan 
tree as expected. Such tests need no external data.
 2. Tests with external data (real stats) then can focus on whether the numbers 
are applied where expected, rather than using the external data to validate 
planning calcs.
 3. Queries perform adequately even in the absence of stats. This take pressure 
off needing to recalc stats over and over.
 4. Reduction of complexity is a good thing. Having adequate unit tests to 
cover all paths is better. This then allows us to implement (and test) both the 
default and with-stats code paths.

Along these lines, the changes proposed here only make sense if:

1. Unit tests are created (or available) to verify the calcs work as expected.
 2. A mechanism is available to access the impact of the changes in plans that 
will result.

These are non-trivial efforts, so this is more of a longer term suggestion than 
an immediate fix.

> Define a-priori selectivity and NDV values
> ------------------------------------------
>
>                 Key: IMPALA-7601
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7601
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 2.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>
> Impala makes extensive use of table stats during query planning. For example, 
> the NDV (number of distinct values) is used to compute _selectivity_, the 
> degree of reduction (also called the _reduction factor_) provided by a 
> predicate. For example:
> {noformat}
> SELECT * FROM t WHERE t.a = 10
> {noformat}
> If we know that {{t.a}} has an NDV=100, then we can predict (given a uniform 
> distribution of values), that the above query will pick out one of these 100 
> values, and that the reduction factor is 1/100 = 0.01. Thus the selectivity 
> of the predicate {{t.a = 10}} is 0.01.
> h4. Selectivity Without Stats
> All this is good. But, what happens if statistics are not available for table 
> {{t}}? How are we to know the selectivity of the predicate?
> It could be that {{t.a}} contains nothing but the value 10, so there is no 
> reduction at all. I could be that {{t.a}} contains no values of 10, so the 
> reduction is total, no rows are returned. The classic solution is to assume 
> that the user put the predicate in the query for the purpose of subsetting 
> the data. The classic value, shown in the [Ramakrishnan and Gehrke 
> book|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  is to assume a 90% reduction, or a selectivity of 0.1. Indeed this value is 
> seen in 
> [Impala|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/Expr.java]:
> {noformat}
>   // To be used where we cannot come up with a
>   // better estimate (selectivity_ is -1).
>   public static double DEFAULT_SELECTIVITY = 0.1;
> {noformat}
> As it turns out, however, the actual implementation is a bit more complex, as 
> hinted at by the above comment. Impala relies on stats. Given stats, 
> specifically the NDV, we compute selectivity as:
> {noformat}
> selectivity = 1 / ndv
> {noformat}
> What happens if there is no available NDV? In the case, we skip the 
> selectivity calculation and leave it at a special default value of -1.0, 
> which seems to indicate "unknown". See 
> [{{BinaryPredicate.analyzeImpl()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java].
> Later, when we use the selectivity to calculate reduction factors, we simply 
> skip any node with a selectivity of -1. You can see that in 
> [{{PlanNode.computeCombinedSelectivity()}}|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/PlanNode.java].
> The result is that Impala is a bit more strict than classic DB optimizers. If 
> stats are present, they are used. If stats are not present, Impala assumes 
> that predicates have no effect.
> This ticket proposal a number of interrelated changes to add a-priori (before 
> observation) defaults for selectivity and NDV based on classic DB practice.
> h4. Proposal: Add A-priori Selectivity Values
> But, we said earlier that users include a predicate because they expect it to 
> do something. So, we are actually discarding the (albeit vague) information 
> that the user provided.
> This is why many optimizers go ahead and assume a default 0.1 reduction 
> factor for equality predicates even if no stats are available. The first 
> proposal of this ticket is to use that default reduction factor even if no 
> stats are present. This says that some reduction will occur, but, to be 
> conservative, we assume not a huge reduction.
> h4. Proposal: Add Selectivity for All Predicate Operators
> As present, Impala computes reduction factors only for equality nodes. (See 
> IMPALA-7560.) The book suggests rule-of-thumb estimates for other operators:
> * {{!=}} - 0.1
> * {{<}}, {{<=}}, {{>}}, {{>=}} - 0.3
> * {{BETWEEN}} - 0.25
> Over in the Drill project, DRILL-5254 attempted to work out better estimates 
> based on math and probability. However, the conclusion there was that, 
> without NDV and histograms, there is more information in the user's intent 
> than in the math. That is, if the user writes {{WHERE t.a \!= 10}}, there is 
> a conditional probability that the user believes that this is a highly 
> restrictive predicate, especially on big data. So, the reduction factor 
> (which is a probability) is the same for {{=}} and {{!=}} in the absence of 
> information. The same reasoning probably led to the rule-of-thumb values in 
> the 
> [R&G|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  book.
> So, the second proposal is that Impala use the classic numbers for other 
> operators when no stats are available.
> h4. Proposal: Use Stats-Based Selectivity Estimates When Available
> If stats are available, then we can "run the numbers" and get better 
> estimates:
> * {{p(a = x)}} = 1 / NDV
> * {{p(a != x)}} = {{1 - p(a = x)}} = 1 - 1 / NDV
> So, the third proposal is to use the above math for the {{!=}} case.
> h4. Proposal: Use Defaults when Stats are Unavailable or Not Useful
> As it turns out, without histograms, the NDV by itself gives us no 
> information by which to compute reduction for an inequality. So the, classic 
> a-priori defaults are the still best guess.
> This leads to the fourth proposal: that the classic defaults are used unless 
> stats provides additional information to use instead. Said another way, 
> retire the current selectivity = -1 ("undefined") concept: we will always 
> have a selectivity, even if just a rule-of-thumb one. That is, some 
> information is better than none.
> h4. Proposal: Define an A-priori NDV Estimate
> Impala uses NDV for a number of purposes. If no stats are available, Impala 
> assumes no NDV. Given the assumed reduction factor, we can guess an NDV = 
> 1/0.1 = 10. Depending on where NDV is used (need to research this), we might 
> want to choose some other value, perhaps 100. So, the fifth proposal is to 
> assume an a-priori (before observation) NDV value, with the actual default 
> value TBD.
> h4. Proposal: Separate the NDV=0 and Unknown NDV cases
> As described in IMPALA-7310, Impala does not currently handle the case of a 
> table with stats, but a column in the table contains only NULL values. The 
> stats mechanism defines HDV as "number of distinct non-null values". So, a 
> table full of nulls has NDV = 0. Unfortunately, if no stats are available at 
> all, we also have NDV = 0.
> So, the sixth proposal is to separate these cases. If we can tell that a 
> column is a nullable type, assume that the actual NDV is (stats NDV + 1). 
> (This is, in fact the NDV calc used by Impala itself when computing NDV for 
> expressions. See 
> [{{ExprNdvTest}}|https://github.com/apache/impala/blob/master/fe/src/test/java/org/apache/impala/analysis/ExprNdvTest.java].
> To avoid impact to existing tests, apply this rule only when it makes an 
> impact, when the NDV value is, say, less than 10. (No reason to count the 
> null value if NDV is large.) So, a column with all nulls will have an 
> (adjusted) NDV = 1.
> Then, as noted above, if no stats exist, use the a-priori NDV.
> h4. References
> * [Cost Estimation 
> Overview|https://courses.cs.washington.edu/courses/cse544/09wi/lecture-notes/lecture8/lecture8.pdf]
> * [Reduction 
> factors|http://pages.cs.wisc.edu/~dbbook/openAccess/Minibase/optimizer/costformula.html]
>  from the classic Ramakrishnan and Gehrke book.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to