Hello Qifan Chen, Shant Hovsepian, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16792

to look at the new patch set (#5).

Change subject: IMPALA-10360: Allow simple limit to be treated as sampling hint
......................................................................

IMPALA-10360: Allow simple limit to be treated as sampling hint

As a follow-up to IMPALA-10314, it is sometimes useful to consider
a simple limit as a way to sample from a table if a relevant hint
has been provided. Doing a sample instead of pure limit serves
dual purposes: (a) it still helps with reducing the planning time
since the scan ranges need be computed only for the sample files,
(b) it allows sufficient number of files/rows to be read from
the table such that after applying filter conditions or joins with
another table, the query may still produce the N rows needed for
limit.

This fuctionality is especially useful if the query is against a
view (note that TABLESAMPLE clause cannot be applied to a view).

In this patch, a new table level hint, 'convert_limit_to_sample'
is added. If this hint is attached to a table either in the main
query block or within a view/subquery and simple limit optimization
conditions are satisfied (according to IMPALA-10314), the limit
is converted to a table sample. For example:

 set optimize_simple_limit = true;
 CREATE VIEW v1 as SELECT * FROM T [convert_limit_to_sample]
    WHERE [always_true] <predicate>;
 SELECT * FROM v1 LIMIT 10;

In this case, the limit 10 is converted to a sample of T and the
sampling percent is the greater of 1% or ratio (in percent) of
limit to the estimated row count of the table (after partition
pruning).

Testing:
 - Added a alltypes_date_partition_2 table where the date and
   timestamp values match (this helps with setting the
   'always_true' hint).
 - Added views with 'convert_limit_to_sample' and 'always_true'
   hints and added new tests against the views. Modified a few
   existing tests to reference the new table variant.
 - Added an end-to-end test.

Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
---
M fe/src/main/java/org/apache/impala/analysis/CompoundPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M fe/src/main/java/org/apache/impala/analysis/TableRef.java
M fe/src/main/java/org/apache/impala/planner/HdfsPartitionPruner.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M testdata/bin/compute-table-stats.sh
M testdata/datasets/functional/functional_schema_template.sql
M 
testdata/workloads/functional-planner/queries/PlannerTest/optimize-simple-limit.test
M 
testdata/workloads/functional-query/queries/QueryTest/range-constant-propagation.test
10 files changed, 279 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/92/16792/5
--
To view, visit http://gerrit.cloudera.org:8080/16792
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ife05a5343c913006f7659949b327b63d3f10c04b
Gerrit-Change-Number: 16792
Gerrit-PatchSet: 5
Gerrit-Owner: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Shant Hovsepian <sh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>

Reply via email to