Qifan Chen has uploaded a new patch set (#29). ( 
http://gerrit.cloudera.org:8080/17478 )

Change subject: IMPALA-10709: Min/max filters should be enabled for joins on 
sorted columns in Parquet tables
......................................................................

IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in 
Parquet tables

This patch enables min/max filters for equi-joins on sort-by
columns in a Parquet table created by Impala. This is to take advantage
of Impala sorting the min/max values in column index in each data
file for the table. When there are multiple sort-by columns in the
table, only the leading column will be assigned a min/max filter. The
control knob is query option minmax_filter_sorted_columns, default to
true.

When minmax_filter_sorted_columns is true and the threshold (query
option minmax_filter_threshold) is 0, the patch automatically assigns
a reasonable value for the threshhold, and selects PAGE to be the
filtering level (query option minmax_filtering_level). When the
threshold is greater than 0, no adjustment will be made to either the
threshold or the filtering level. When the min and max column stats
exist on the leading sort column, these stats can be used to help
select filters that are most likely helpful.

When minmax_filter_sorted_columns is set to false, no min/max filters
will be specifically assigned to the leading sort by columns.

In the backend, the skipped pages can be quickly found by taking a
fast code path to find the corresponding lower and the upper bounds
in the sorted min and max value arrays, given a range in the filter.
The skipped pages are expessed as page ranges which are translated
into row ranges later on.

A new query option minmax_filter_fast_code_path is added to control
the work of the fast code path. It can take ON (default), OFF, or
VERIFICATION three values. The last helps verify that the results
from both the fast and the regular code path are the same.

Preliminary performance testing (joining into a simpplified TPCH
lineitem table of 2 sorted BIG INT columns and a total of 6001215
rows) confirms that min/max filtering on leading sort-by columns
improves the performance of scan operators greatly. The best result
is seen with pages containing no more than 24000 rows: 84.62ms
(page level filtering) vs. 115.27ms (row group level filtering)
vs 137.14ms (no filtering). The query utilized is as follows.

  select straight_join a.l_orderkey from
  simpflified_lineitem a join [SHUFFLE] tpch_parquet.lineitem b
  where a.l_orderkey = b.l_orderkey and b.l_receiptdate = "1998-12-31"

Also fixed in the patch are abnormal min/max displays in "Final
filter table" section in a profile for DECIMAL, TIMESTAMP and DATE
data types, and reading DATE column index in batch without validation.

Testing:
  1). Added new tests in overlap_min_max_filters.test to verify
      a) Min/max filters are only created for leading sort by column;
      b) Query option minmax_filter_sorted_columns works;
      c) Query option minmax_filter_fast_code_path works.
  2). Added new tests in parquet-page-index-test.cc to test fast
      code path under various conditions;
  3). Core [TBD]

Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
---
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-page-index-test.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/raw-value.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/debug-util.cc
M be/src/util/debug-util.h
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
21 files changed, 972 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17478/29
--
To view, visit http://gerrit.cloudera.org:8080/17478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Gerrit-Change-Number: 17478
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

Reply via email to