Hello Aman Sinha, Kurt Deschler, David Rorke, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/21377

to look at the new patch set (#5).

Change subject: IMPALA-8042: Assign BETWEEN selectivity for discrete-unique 
column
......................................................................

IMPALA-8042: Assign BETWEEN selectivity for discrete-unique column

Impala frontend can not evaluate BETWEEN/NOT BETWEEN predicate directly.
It needs to transform a BetweenPredicate into a CompoundPredicate
consisting of upper bound and lower bound BinaryPredicate through
BetweenToCompoundRule.java. The BinaryPredicate can then be pushed down
or rewritten into other form by another expression rewrite rule.
However, the selectivity of BetweenPredicate or its derivatives remains
unassigned and often collapses with other unknown selectivity predicates
to have collective selectivity equals Expr.DEFAULT_SELECTIVITY (0.1).

This patch adds a narrow optimization of BetweenPredicate selectivity
when the following criteria are met:

1. The BetweenPredicate is bound to a slot reference of a single column
   of a table.
2. The column type is discrete, such as INTEGER or DATE.
3. The column stats are available.
4. The column is sufficiently unique based on available stats.
5. The BETWEEN/NOT BETWEEN predicate is in good form (lower bound value
   <= upper bound value).
6. The final calculated selectivity is less than or equal to
   Expr.DEFAULT_SELECTIVITY.

If these criteria are unmet, the Planner will revert to the old
behavior, which is letting the selectivity unassigned.

Since this patch only target BetweenPredicate over unique column, the
following query will still have the default scan selectivity (0.1):

select count(*) from tpch.customer c
where c.c_custkey >= 1234 and c.c_custkey <= 2345;

While this equivalent query written with BETWEEN predicate will have
lower scan selectivity:

select count(*) from tpch.customer c
where c.c_custkey between 1234 and 2345;

This patch calculates the BetweenPredicate selectivity during
transformation at BetweenToCompoundRule.java. The selectivity is
piggy-backed into the resulting CompoundPredicate and BinaryPredicate as
betweenSelectivity_ field, separate from the selectivity_ field.
Analyzer.getBoundPredicates() is modified to prioritize the derived
BinaryPredicate over ordinary BinaryPredicate in its return value to
prevent the derived BinaryPredicate from being eliminated by a matching
ordinary BinaryPredicate.

Testing:
- Add table functional_parquet.unique_with_nulls.
- Add FE tests in ExprCardinalityTest#testBetweenSelectivity,
  ExprCardinalityTest#testNotBetweenSelectivity, and
  PlannerTest#testScanCardinality.
- Pass core tests.

Change-Id: Ib349d97349d1ee99788645a66be1b81749684d10
---
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/planner/PlanNode.java
M fe/src/main/java/org/apache/impala/rewrite/BetweenToCompoundRule.java
M fe/src/test/java/org/apache/impala/analysis/ExprCardinalityTest.java
M testdata/bin/compute-table-stats.sh
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-planner/queries/PlannerTest/card-scan.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q05.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q12.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q16.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q20.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q21.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q32.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q37.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q40.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q77.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q80.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q82.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q92.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q94.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q95.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/tpcds_cpu_cost/tpcds-q98.test
27 files changed, 3,908 insertions(+), 3,502 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/77/21377/5
--
To view, visit http://gerrit.cloudera.org:8080/21377
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib349d97349d1ee99788645a66be1b81749684d10
Gerrit-Change-Number: 21377
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: David Rorke <dro...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kdesc...@cloudera.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>

Reply via email to