Qifan Chen has uploaded a new patch set (#33). ( http://gerrit.cloudera.org:8080/17478 )
Change subject: IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables ...................................................................... IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables This patch enables min/max filters for equi-joins on lexical sort-by columns in a Parquet table created by Impala by default. This is to take advantage of Impala sorting the min/max values in column index in each data file for the table. The control knob is query option minmax_filter_sorted_columns, default to true. When minmax_filter_sorted_columns is true, the patch will generate min/max filters only for the leading sort columns. The normal control knobs minmax_filter_threshold (for threshold) and minmax_filtering_level (for filtering level) still work. When the threshold is 0, the patch automatically assigns a reasonable value for the threshhold, and selects PAGE to be the filtering level. In the backend, the skipped pages are quickly found by taking a fast code path to identify the corresponding lower and the upper bounds in the sorted min and max value arrays, given a range in the filter. The skipped pages are expressed as page ranges which are translated into row ranges later on. A new query option minmax_filter_fast_code_path is added to control the work of the fast code path. It can take ON (default), OFF, or VERIFICATION three values. The last helps verify that the results from both the fast and the regular code path are the same. Preliminary performance testing (joining into a simpplified TPCH lineitem table of 2 sorted BIG INT columns and a total of 6001215 rows) confirms that min/max filtering on leading sort-by columns improves the performance of scan operators greatly. The best result is seen with pages containing no more than 24000 rows: 84.62ms (page level filtering) vs. 115.27ms (row group level filtering) vs 137.14ms (no filtering). The query utilized is as follows. select straight_join a.l_orderkey from simpflified_lineitem a join [SHUFFLE] tpch_parquet.lineitem b where a.l_orderkey = b.l_orderkey and b.l_receiptdate = "1998-12-31" Also fixed in the patch are abnormal min/max display in "Final filter table" section in a profile for DECIMAL, TIMESTAMP and DATE data types, and reading DATE column index in batch without validation. Testing: 1). Added a new test overlap_min_max_filters_on_sorted_columns.test to verify a) Min/max filters are only created for leading sort by column; b) Query option minmax_filter_sorted_columns works; c) Query option minmax_filter_fast_code_path works. 2). Added new tests in parquet-page-index-test.cc to test fast code path under various conditions; 3). Ran core tests successfully. Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 --- M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-common.cc M be/src/exec/parquet/parquet-common.h M be/src/exec/parquet/parquet-page-index-test.cc M be/src/runtime/coordinator.cc M be/src/runtime/raw-value.cc M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/debug-util.cc M be/src/util/debug-util.h M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/ImpalaService.thrift M common/thrift/Query.thrift M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java A testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters_on_sorted_columns.test M tests/query_test/test_runtime_filters.py 22 files changed, 919 insertions(+), 41 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17478/33 -- To view, visit http://gerrit.cloudera.org:8080/17478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 Gerrit-Change-Number: 17478 Gerrit-PatchSet: 33 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Aman Sinha <amsi...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>