[ 
https://issues.apache.org/jira/browse/IMPALA-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned IMPALA-7944:
-----------------------------------

    Assignee:     (was: Paul Rogers)

> count(*) correctly has NDV=1 via being labeled as constant
> ----------------------------------------------------------
>
>                 Key: IMPALA-7944
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7944
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> The {{count\(*)}} function has an NDV of 1: the function always returns a 
> single value. 
> Presently, {{count\(*)}} has an NDV of 1 via a broken process, but 
> {{sum(id)}} incorrectly has the NDV of {{id}}, but should be 1.
> This is important because it tells us that the query:
> {code:sql}
> SELECT COUNT(*) FROM foo
> {code}
> Returns just one row. All good.
> In the analyzer, we set a value of NDV=1 via an incorrect process: by 
> labeling {{count\(*)}} as constant:
>  * For historical reasons, NDV calculations occur before a node is analyzed.
>  * We use the default NDV calc: if the node is constant, set NDV = 1, else 
> compute it.
>  * Since the function node for {{count\(*)}} is not analyzed, we determine 
> constant-ness from an inspection.
>  * All checks for non-constantness fail, leaving the final check: a function 
> is constant if either a) it has no arguments, or b) all its arguments are 
> constant.
>  * Since {{count\(*)}} has no expression arguments, and is not marked as 
> non-deterministic, we infer it must be costant.
>  * Therefore, it's NDV is set to 1.
> This, of course, highly unstable for multiple reasons:
>  * NDV calculations are done before the node is analyzed. This means, NDV 
> calculations for a {{SlotRef}} would fail because the ref has not yet been 
> resolved to a column. (The {{SlotRef}} has special code to work around this 
> fact.)
>  * The "treat zero-argument functions as constants and so use NDV=1" rule 
> works for {{count\(*)}}, but not for {{count(c)}}, nor or {{sum(c)}}, both of 
> which should have NDV=1.
>  * {{count\(*)}} is not really a constant; its NDV=1 setting should not 
> really on (benignly) assuming it is.
>  * The NDV check const-ness is temporary; once the node is analyzed, it is 
> correctly marked as non-const. So, the calcs rely on one path saying the the 
> function is const, another path saying it is not const.
> The current NDV for a {{sum(id)}} function is the NDV of {{id}}, which is 
> 7300 in this particular query. The NDV of {{sum(id)}} should be 1.
> This should be cleaned up to provide a more reliable, understandable way of 
> achieving the goal of NDV=1.
> As it turns out, this seemed to have been a known issue in the code:
> {code:java}
>     // TODO: we can't correctly determine const-ness before analyzing 'fn_'. 
> We should    
>     // rework logic so that we do not call this function on unanalyzed exprs. 
>             
>     // Aggregate functions are never constant.                                
>             
> {code}
> This defect affects memory estimates. With the correct NDV estimates, several 
> {{PlannerTest}} cases change to include a lower memory reservation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to