Qifan Chen has uploaded a new patch set (#24). ( 
http://gerrit.cloudera.org:8080/17478 )

Change subject: IMPALA-10709: Min/max filters should be enabled for joins on 
sorted columns in Parquet tables
......................................................................

IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in 
Parquet tables

This patch enables min/max filters for equi-joins on sort by
columns in a Parquet table created by Impala. This is to take advantage
of Impala sorting the min/max values in column index in each data
file for the table. When there are multiple sort by columns in the
table, only the leading column will be assigned a min/max filter. The
control knob is query option minmax_filter_sorted_columns, default to
true.

When minmax_filter_sorted_columns is true and the threshold (query
option minmax_filter_threshold) is 0, the patch automatically assigns
a reasonable value for the threshhold, and selects PAGE to be the
filtering level (query option minmax_filtering_level). When the
threshold is greater than 0, no adjustment will be made to either the
threshold or the filtering level. When the min and max column stats
exist on the leading sort column, these stats can be used to help
select filters that are most likely helpful.

When minmax_filter_sorted_columns is set to false, no min/max filters
will be specifically assigned to the leading sort by columns.

In the backend, the skipped pages can be quickly found by taking a
fast code path to find the lower and the upper bounds in the sorted
min and max value arrays, given a range in the filter. The skipped
pages are expessed as page ranges which later are translated into
row ranges.

A new query option minmax_filter_fast_code_path is enabled to control
the work of the fast code path. It can take ON (default), OFF, or
VERIFICATION three options. The last option helps verify the results
from both the fast and the regular code path are identical.

Testing:
  1). Added two new tests in overlap_min_max_filters.test to verify
      a) Min/max filters are only created for leading sort by column;
      b) Query option minmax_filter_sorted_columns works.
  2). Added new tests in parquet-page-index-test.cc to cover the fast
      code path;
  3). Core [TBD]
  4). Performance [TBD]

Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-page-index-test.cc
M be/src/runtime/raw-value.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
18 files changed, 783 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17478/24
--
To view, visit http://gerrit.cloudera.org:8080/17478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Gerrit-Change-Number: 17478
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

Reply via email to