Qifan Chen has uploaded a new patch set (#25). ( http://gerrit.cloudera.org:8080/17478 )
Change subject: IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables ...................................................................... IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables This patch enables min/max filters for equi-joins on sort by columns in a Parquet table created by Impala. This is to take advantage of Impala sorting the min/max values in column index in each data file for the table. When there are multiple sort by columns in the table, only the leading column will be assigned a min/max filter. The control knob is query option minmax_filter_sorted_columns, default to true. When minmax_filter_sorted_columns is true and the threshold (query option minmax_filter_threshold) is 0, the patch automatically assigns a reasonable value for the threshhold, and selects PAGE to be the filtering level (query option minmax_filtering_level). When the threshold is greater than 0, no adjustment will be made to either the threshold or the filtering level. When the min and max column stats exist on the leading sort column, these stats can be used to help select filters that are most likely helpful. When minmax_filter_sorted_columns is set to false, no min/max filters will be specifically assigned to the leading sort by columns. In the backend, the skipped pages can be quickly found by taking a fast code path to find the lower and the upper bounds in the sorted min and max value arrays, given a range in the filter. The skipped pages are expessed as page ranges which later are translated into row ranges. A new query option minmax_filter_fast_code_path is enabled to control the work of the fast code path. It can take ON (default), OFF, or VERIFICATION three options. The last option helps verify the results from both the fast and the regular code path are identical. Testing: 1). Added two new tests in overlap_min_max_filters.test to verify a) Min/max filters are only created for leading sort by column; b) Query option minmax_filter_sorted_columns works. 2). Added new tests in parquet-page-index-test.cc to cover the fast code path; 3). Core [TBD] 4). Performance [TBD] Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-page-index-test.cc M be/src/runtime/raw-value.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test 18 files changed, 782 insertions(+), 23 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17478/25 -- To view, visit http://gerrit.cloudera.org:8080/17478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 Gerrit-Change-Number: 17478 Gerrit-PatchSet: 25 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>