Qifan Chen has uploaded a new patch set (#28). ( http://gerrit.cloudera.org:8080/17295 )
Change subject: IMPALA-10650: Bailout min/max filters in hash join builder early ...................................................................... IMPALA-10650: Bailout min/max filters in hash join builder early This change set addresses the weakness in population min/max filters in the hash join builder by periodically measuring the usefulness of each filter and set the 'always_true_' flag accordingly. Once set to true, the insertion to such a filter completely skips the steps from the evaluation of the value from a row to the verification of the value in the min/max range. This optimization is LLVM-enabled. In addition, a new flag 'is_min_max_value_present' is added to TRuntimeFilterTargetDesc to indicate whether the min/max column stats is present in the query plan. The flag eliminates the need to check the presence of min/max stats for every row in back-end. Early bail out improves the HJ builder step in general. For example, the step for join node #11 in TPCDS Q8 improves 13%, and the step for join node #8 in TPCDS Q16 improves 3.2%. The Insert() methods are optimized with branch prediction compiler hints which yield the following improvement when tested with the insertion of 10000 randomly generated items. Small Integers: 7.0% Integers: 4.1% Big Integers: 4.3% Strings: 5.6% Dates: 4.4% Timestamps: 10.7% Decimals(4): 10.4% Decimals(8): 9.1% In addition, the min/max stats for pages are read in batches with a fast track version for column types of int32_t, int64_t, float, double and date that have identical storage format as Parquet. For a row group, the page locations are read only once, instead of once for every page skipped, resulting in 100x speedup when a subset of 199 pages are skipped. Testing: 1. Ran core test; 2. Ran performance test (TBD). Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183 --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/filter-context.cc M be/src/exec/parquet/hdfs-parquet-scanner.cc M be/src/exec/parquet/hdfs-parquet-scanner.h M be/src/exec/parquet/parquet-column-stats.cc M be/src/exec/parquet/parquet-column-stats.h M be/src/exec/parquet/parquet-column-stats.inline.h M be/src/exec/parquet/parquet-common.h M be/src/exec/partitioned-hash-join-builder.cc M be/src/exec/partitioned-hash-join-builder.h M be/src/runtime/runtime-filter-ir.cc M be/src/util/min-max-filter-ir.cc M be/src/util/min-max-filter.cc M be/src/util/min-max-filter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java M fe/src/main/java/org/apache/impala/util/TColumnValueUtil.java 17 files changed, 977 insertions(+), 297 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/17295/28 -- To view, visit http://gerrit.cloudera.org:8080/17295 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183 Gerrit-Change-Number: 17295 Gerrit-PatchSet: 28 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>