[ 
https://issues.apache.org/jira/browse/IMPALA-8217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-8217:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> Remove scan operator exponential backoff in cardinality calcs
> -------------------------------------------------------------
>
>                 Key: IMPALA-8217
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8217
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 3.1.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> The planner's scan nodes use an exponential back-off to compute the joint 
> filter selectivity:
> {code:java}
>     double result = 1.0;
>     for (int i = 0; i < selectivities.size(); ++i) {
>       // Exponential backoff for each selectivity multiplied into the final 
> result.
>       result *= Math.pow(selectivities.get(i), 1.0 / (double) (i + 1));
>     }
> {code}
> The reasoning is:
> {code:java}
>   /*
>    * 2. Two selectivities, whether known or unknown, could be correlated. 
> Assuming
>    *    independence can lead to significant underestimation.
>    *
>    * The second issue is addressed by an exponential backoff when multiplying 
> each
>    * additional selectivity into the final result.
>    */
> {code}
> The logic makes some kind of sense, especially since Impala calculates 
> selectivity only for one kind of predicate: {{col = const}}, and ignores 
> selectivity for all others. (Actually, replaces all others with a single 
> combined selectivity estimate of 0.1.}}
> The problem, however, is that this violates the commutative and linear models 
> that underly relational theory. In general, relational theory says it does 
> not matter if we join then filter, or filter, then join. If we have two 
> filters, f1 and f2, it does not matter if we apply f1 in the scan and f2 
> after the join: the resulting result set is the same. (Performance differs, 
> but we ignore that when computing cardinality.)
> With exponential back-off, this no longer applies. It becomes difficult, say, 
> to address the issue in IMPALA-XXXX in which we need to apply filter 
> selectivity at the scan level, then remove it again at the join level. The 
> reason is that the cardinality applied at the scan level may have been backed 
> off so simply using the expression's report of its cardinality may not be the 
> value used by the scan.
> Also, oddly, the same expression could apply different selectivity in 
> different scans. The code sorts selectivities in ascending order. So, in scan 
> 1, filter f1 might apply its full selectivity, while in scan 2 (assuming a 
> correlated filter), it might apply the square or cube of the selectivity.
> h4. Solution
> The solution here is simply to:
> * Compute the selectivity for all predicates so we don't have to fudge.
> * Remove correlated predicates from join calcs (See IMPALA-XXXX).
> * Remove exponential back-off calculations so that filter selectivity can be 
> properly decomposed for the above task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to