[ https://issues.apache.org/jira/browse/IMPALA-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146608#comment-17146608 ]
Qifan Chen edited comment on IMPALA-2658 at 6/26/20, 7:58 PM: -------------------------------------------------------------- The enhancement involves both the front and the back end. In the frond end, a 2nd parameter in NDV() is allowed and verified. In addition, the data type of the intermediate result in the plan records the correct amount of memory needed. This is assisted by the inclusion of additional template aggregate function objects in the built-in database. In the back end, the current hardcoded precision of 10 is removed. The HLL algorithm now works with the default, or any valid precision values. The precision value is computed from the corresponding scale value stored in the query plan. Ran estimation error tests against a total of 22 different data sets loaded into external impala tables: - 5 sets with 10 million unique strings - 5 sets with 10 million unique integers - 5 sets with 100 million unique strings - 5 sets with 97 million unique integers - 1 set with 499 million unique strings - 1 set with 450 million unique integers A follow-up task is scheduled to run additional tests against large tables. was (Author: sql_forever): The enhancement involves both the front and the back end. In the frond end, a 2nd parameter in NDV() is allowed and verified. In addition, the data type of the intermediate result in the plan records the correct amount of memory needed. This is assisted by the inclusion of additional template aggregate function objects in the built-in database. In the back end, the current hardcoded precision of 10 is removed. The HLL algorithm now works with the default, or any valid precision values. The precision value is computed from the corresponding scale value stored in the query plan. Ran estimation error tests against a total of 22 different data sets loaded into external impala tables: - 5 sets with 10 million unique strings - 5 sets with 10 million unique integers - 5 sets with 100 million unique strings - 5 sets with 97 million unique integers - 1 set with 499 million unique strings - 1 set with 450 million unique integers > Extend the NDV function to accept a precision > --------------------------------------------- > > Key: IMPALA-2658 > URL: https://issues.apache.org/jira/browse/IMPALA-2658 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Impala 2.2.4 > Reporter: Peter Ebert > Assignee: Qifan Chen > Priority: Minor > Labels: ramp-up > Fix For: Impala 4.0 > > Attachments: Comparison of HLL Memory usage, Query Duration and > Accuracy.jpg > > > Hyperloglog algorithm used by NDV defaults to a precision of 10. Being able > to set this precision would have two benefits: > # Lower precision sizes can speed up the performance, as a precision of 9 has > 1/2 the number of registers as 10 (exponential) and may be just as accurate > depending on expected cardinality. > # Higher precision can help with very large cardinalities (100 million to > billion range) and will typically provide more accurate data. Those who are > presenting estimates to end users will likely be willing to trade some > performance cost for more accuracy, while still out performing the naive > approach by a large margin. > Propose adding the overloaded function NDV(expression, int precision) > with accepted range between 18 and 4 inclusive. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org