fifteencai has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/17306 )
Change subject: IMPALA-10445: Adjust NDV's scale with query option ...................................................................... IMPALA-10445: Adjust NDV's scale with query option This is a new way to control NDV's scale. Since IMPALA-2658, we can trade memory for more accurate estimation by setting larger `scale` in SQL function NDV(<expr>, <scale>). However the use of larger NDV scale requires the modification of SQL queries which may not be practical in certain applications: - Firstly, SQL writers are reluctant to lower that scale. They prone to fill up the scale, which will make the cluster unstable. Especially when there are `group by`s with high cardinalities. So it is wiser to let cluster admin other than sql writer choose appropriate scale. - Secondly, In some application scenarios, queries are stored in DBs. In a BI system, for example, rewriting thousands of SQLs is risky. In this commit, we introduced a new Query Option `DEFAULT_NDV_SCALE` with the following semantics: 1. The allowed value is in the range [1..10]; 2. Previously, the scale used in NDV(<expr>) functions was fixed at 2. Now the scale is provided by the newly added query options. 3. It does not influence the NDV scale for SQL function NDV(<expr>, <scale>) in which the NDV scale is provided by the 2nd argument <scale>. We also refactored method `Analyze` to make sure APPX_COUNT_DISTINCT can work with this query option. After this, cluster admins can substitute `count(distinct <expr>)` with `ndv(<expr>, scale)`. Implementation details: - The default value of DEFAULT_NDV_SCALE is 2, so we won't change the default ndv behavior. - We port `CountDistinctToNdv` transform logic from `SelectStmt.analyze()` to `ExprRewriter`, making it compatible with further rewrite rules. - The newly added rewrite rule `DefaultNdvScaleRule` is applied after `CountDistinctToNdvRule`. Usage: To set a default ndv scale: ``` SET DEFAULT_NDV_SCALE = 10; ``` To unset: ``` SET DEFAULT_NDV_SCALE = 2; ``` Here are test results of a typical workload (cardinality=40,090,650): +====================================================================+ | Metric | Count Distinct | NDV2 | NDV5 | NDV10 | +--------------------------------------------------------------------+ | Memory(GB) | 3.83 | 1.84 | 1.85 | 1.89 | | Duration(s) | 182.89 | 30.22 | 29.72 | 29.24 | | ErrorRate | 0% | 1.8% | 1.17% | 0.06% | +====================================================================+ Testing: 1) Added 3 unit test cases in `ExprRewriteRulesTest`. 2) Added 5 unit test cases in `ExprRewriterTest`. 3) Ran all front-end unit test, passed. 4) Added a new query-option test. Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50 --- M be/src/service/query-options-test.cc M be/src/service/query-options.cc M be/src/service/query-options.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/analysis/Analyzer.java M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java A fe/src/main/java/org/apache/impala/rewrite/CountDistinctToNdvRule.java A fe/src/main/java/org/apache/impala/rewrite/DefaultNdvScaleRule.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriteRulesTest.java M fe/src/test/java/org/apache/impala/analysis/ExprRewriterTest.java 11 files changed, 314 insertions(+), 34 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17306/16 -- To view, visit http://gerrit.cloudera.org:8080/17306 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I1669858a6e8252e167b464586e8d0b6cb0d0bd50 Gerrit-Change-Number: 17306 Gerrit-PatchSet: 16 Gerrit-Owner: fifteencai <fifteen...@tencent.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: fifteencai <fifteen...@tencent.com>