Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17295 )
Change subject: IMPALA-10650: Bailout min/max filters in hash join builder early ...................................................................... Patch Set 27: (14 comments) Answer Riza and Zoltan's review comments. http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.h File be/src/exec/filter-context.h: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.h@155 PS27, Line 155: static bool ShouldRejectFilterBasedOnColumnStats( > nit: I'm OK with reformatting, but I'm not sure if it was intended Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.cc File be/src/exec/filter-context.cc: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.cc@231 PS27, Line 231: example > Could you please update this example? Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@984 PS27, Line 984: scalar_reader->offset_index_ > nit: simply 'offset_index' from at L978? Good catch! Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@997 PS27, Line 997: DCHECK > nit: DCHECK_GE could be used. It has the advantage that in case of failure Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@1010 PS27, Line 1010: Expected > Any reason why we don't want the error message to be logged here? Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@1107 PS27, Line 1107: > nit: probably unintended space Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc File be/src/exec/parquet/parquet-column-stats.cc: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc@281 PS27, Line 281: const int remainder = num_values % batch; > nit: do we need to calculate remainder? In the second for-loop we could hav Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc@321 PS27, Line 321: const int remainder = num_values % batch; > nit: same as above Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h File be/src/exec/parquet/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h@143 PS27, Line 143: DCHECK(buffer.size() == sizeof(int32_t)); : DCHECK(parquet_type == parquet::Type::INT32); > nit: DCHECK_EQ Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h@159 PS27, Line 159: DCHECK(buffer.size() == sizeof(int32_t)); : DCHECK(parquet_type == parquet::Type::INT32); > nit: DCHECK_EQ Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc File be/src/exec/partitioned-hash-join-builder.cc: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@337 PS27, Line 337: for (const FilterContext& ctx : filter_ctxs_) { > I wonder if we can speed this up by iterating ONLY the minmax filters. Sounds like a good idea. A new vector minmax_filter_ctxs_ is added to cache the local min max filter contexts. An element from it is removed if the element is set to AlwaysTrue. The element will not be bothered with overlap check again. http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@345 PS27, Line 345: not_useful = false; > nit: I think it'd be a bit more readable if we decrease the negations, i.e. Done http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@404 PS27, Line 404: PublishRuntimeFilters(num_build_rows); > It seems to me that PublishRuntimeFilters is only called here in FinalizeBu Partial filter publishing with propagation to scan nodes may be a little bit complicated since it involves network traffic and context management. See PhjBuilder::FinalizeBuild(). With the work optimizing the insertion to an already disabled filter, and the work to only iterate over enabled filters for overlap checking, it looks like we can live with the current publishing strategy. http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/util/min-max-filter-ir.cc File be/src/util/min-max-filter-ir.cc: http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/util/min-max-filter-ir.cc@114 PS27, Line 114: predicion > nit: prediction Done -- To view, visit http://gerrit.cloudera.org:8080/17295 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183 Gerrit-Change-Number: 17295 Gerrit-PatchSet: 27 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com> Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com> Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 18 May 2021 17:48:34 +0000 Gerrit-HasComments: Yes