[ 
https://issues.apache.org/jira/browse/IMPALA-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146608#comment-17146608
 ] 

Qifan Chen commented on IMPALA-2658:
------------------------------------

The enhancement involves both the front and the back end.

In the frond end, a 2nd parameter in NDV() is allowed and verified.
In addition, the data type of the intermediate result in the
plan records the correct amount of memory needed. This is assisted
by the inclusion of additional template aggregate function objects
in the built-in database.

In the back end, the current hardcoded precision of 10 is removed. The
HLL algorithm now works with the default, or any valid precision values.
The precision value is computed from the corresponding scale value
stored in the query plan.

Ran estimation error tests against a total of 22 different data sets loaded 
into external impala tables:
- 5 sets with 10 million unique strings
- 5 sets with 10 million unique integers
- 5 sets with 100 million unique strings
- 5 sets with 97 million unique integers
- 1 set with 499 million unique strings
- 1 set with 450 million unique integers

 

> Extend the NDV function to accept a precision
> ---------------------------------------------
>
>                 Key: IMPALA-2658
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2658
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.2.4
>            Reporter: Peter Ebert
>            Assignee: Qifan Chen
>            Priority: Minor
>              Labels: ramp-up
>         Attachments: Comparison of HLL Memory usage, Query Duration and 
> Accuracy.jpg
>
>
> Hyperloglog algorithm used by NDV defaults to a precision of 10.  Being able 
> to set this precision would have two benefits:
> # Lower precision sizes can speed up the performance, as a precision of 9 has 
> 1/2 the number of registers as 10 (exponential) and may be just as accurate 
> depending on expected cardinality.
> # Higher precision can help with very large cardinalities (100 million to 
> billion range) and will typically provide more accurate data.  Those who are 
> presenting estimates to end users will likely be willing to trade some 
> performance cost for more accuracy, while still out performing the naive 
> approach by a large margin.
> Propose adding the overloaded function NDV(expression, int precision)
> with accepted range between 18 and 4 inclusive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to