Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17295 )

Change subject: IMPALA-10650: Bailout min/max filters in hash join builder early
......................................................................


Patch Set 27:

(14 comments)

Answer Riza and Zoltan's review comments.

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.h
File be/src/exec/filter-context.h:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.h@155
PS27, Line 155:   static bool ShouldRejectFilterBasedOnColumnStats(
> nit: I'm OK with reformatting, but I'm not sure if it was intended
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/filter-context.cc@231
PS27, Line 231: example
> Could you please update this example?
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@984
PS27, Line 984: scalar_reader->offset_index_
> nit: simply 'offset_index' from at L978?
Good catch!

Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@997
PS27, Line 997: DCHECK
> nit: DCHECK_GE could be used. It has the advantage that in case of failure
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@1010
PS27, Line 1010: Expected
> Any reason why we don't want the error message to be logged here?
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/hdfs-parquet-scanner.cc@1107
PS27, Line 1107:
> nit: probably unintended space
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc@281
PS27, Line 281:   const int remainder = num_values % batch;
> nit: do we need to calculate remainder? In the second for-loop we could hav
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.cc@321
PS27, Line 321:   const int remainder = num_values % batch;
> nit: same as above
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h@143
PS27, Line 143:   DCHECK(buffer.size() == sizeof(int32_t));
              :   DCHECK(parquet_type == parquet::Type::INT32);
> nit: DCHECK_EQ
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/parquet/parquet-column-stats.inline.h@159
PS27, Line 159:   DCHECK(buffer.size() == sizeof(int32_t));
              :   DCHECK(parquet_type == parquet::Type::INT32);
> nit: DCHECK_EQ
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@337
PS27, Line 337:   for (const FilterContext& ctx : filter_ctxs_) {
> I wonder if we can speed this up by iterating ONLY the minmax filters.
Sounds like a good idea.

A new vector minmax_filter_ctxs_ is added to cache the local min max filter 
contexts. An element from it is removed if the element is set to AlwaysTrue. 
The element will not be bothered with overlap check again.


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@345
PS27, Line 345: not_useful = false;
> nit: I think it'd be a bit more readable if we decrease the negations, i.e.
Done


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/exec/partitioned-hash-join-builder.cc@404
PS27, Line 404:     PublishRuntimeFilters(num_build_rows);
> It seems to me that PublishRuntimeFilters is only called here in FinalizeBu
Partial filter publishing with propagation to scan nodes may be a little bit 
complicated since it involves network traffic and context management. See 
PhjBuilder::FinalizeBuild().

With the work optimizing the insertion to an already disabled filter, and the 
work to only iterate over enabled filters for overlap checking, it looks like 
we can live with the current publishing strategy.


http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/util/min-max-filter-ir.cc
File be/src/util/min-max-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17295/27/be/src/util/min-max-filter-ir.cc@114
PS27, Line 114: predicion
> nit: prediction
Done



--
To view, visit http://gerrit.cloudera.org:8080/17295
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183
Gerrit-Change-Number: 17295
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Riza Suminto <riza.sumi...@cloudera.com>
Gerrit-Reviewer: Wenzhe Zhou <wz...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 May 2021 17:48:34 +0000
Gerrit-HasComments: Yes

Reply via email to