Qifan Chen has uploaded a new patch set (#34). ( http://gerrit.cloudera.org:8080/16720 )
Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate ...................................................................... IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate This patch adds a new class of predicates called overlap predicates to aid in the determination of whether a Parquet row group or a page overlap with a range computed from an equi hash join. If not, then the entire row group or page are skipped. An overlap predicate exists as a min/max filter. For the following query, the min and max in such a min/max filter are computed with the values from the join column from table 'b' and become fully available when the entire hash table is built. To evaluate the overlap predicate, these two values are compared against the min/max of each row group or page at the scan node for 'a'. select straight_join count(*) from lineitem_sorted_l_shipdate a join [SHUFFLE] lineitem_sorted_l_shipdate b where a.l_shipdate = b.l_receiptdate and b.l_commitdate = "1992-01-31"; An overlap predicate associated with the column type J (in hash table) and scan column type S will be formed when one of the following is true: Both J and S are booleans Both J and S are integers (tinyint, smallint, int, or bigint) Both J and S are approximate numeric (float or double) Both J and S are Decimals with the same precision and scale Both J and S are strings (STRING, CHAR or VARCHAR) Both J and S are date Both J and S are timestamp Like any existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option does not apply to min/max filters created for overlap predicates. The overlap predicates will always be evaluated, after the min/max conjuncts (if any). Two new run-time profile counters are added to report the number of row groups or pages filtered out via the overlap predicates respectively: 1. NumRuntimeFilteredRowGroups 2. NumRuntimeFilteredPages Testing: 1. Unit tested on various column types with TPCH and TPCDS tables. Benefits were significant when the join column on the outer table is sorted, or when the min/max boundary values of the pages or row groups are monotonic; 2. Added new tests in min_max_filters.test for join column type compatibility and to demonstrate the number of filtered out pages and row groups with the two new profile counters; 3. Added data type specific overlap method tests in min-max-filter-test.cc; 4. Core testing. TBD in this patch: 1. Performance measurement. To do in follow-up JIRAs: 1. Apply the overlap predicate on partition columns; 2. IR code-gen for various MinMaxFilter::EvalOverlap methods. Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 --- M be/src/exec/exec-node.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/scan-node.cc M be/src/runtime/coordinator.cc M be/src/runtime/date-value.cc M be/src/runtime/date-value.h M be/src/runtime/runtime-filter-ir.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/min-max-filter-test.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test 27 files changed, 1,318 insertions(+), 160 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/16720/34 -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 34 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>