[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

Tim Armstrong (Code Review) Tue, 08 Dec 2020 11:43:34 -0800

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )


Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
......................................................................


Patch Set 29:

(10 comments)

I looked at the tests and the frontend part. I didn't look in detail at the 
core of the backend implementation yet.

http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS29, Line 117: NumMinMaxFilteredRowGroups
I wonder if it would make more sense to make this NumRuntimeFilteredRowGroups.

If we extended this in future to evaluate bloom filters or different kind of 
filters, say, I don't think we'd necessarily want a separate counter.


http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/decimal-value.h
File be/src/runtime/decimal-value.h:

http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/decimal-value.h@153
PS29, Line 153:   /// Implementations of add and subtract operators for 
Decimal4Value, Decimal8Value
Are these test-only functions? I don't like including these in the DecimalValue 
class because I think they could encourage antipatterns - generally in query 
execution we don't want to mix decimals and doubles and this would make it 
kinda easy.

I think it would probably be better if callers did the conversion from double 
explicitly, or if the helper functions were implemented in the tests.


http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h@284
PS29, Line 284: add
nit: Add, we only use lower case for simple accessor methods - 
https://google.github.io/styleguide/cppguide.html#Function_Names


http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h@288
PS29, Line 288: subtract
nit: Subtract


http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@679
PS29, Line 679: Boolean
should be returning scalar bool type afaict, not the object version.


http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@714
PS29, Line 714:     // value, DecimalValue, does not carry precision and scale 
with it, which makes
This isn't really true, there's always an associated ColumnType descriptor that 
has that metadata. That said, I don't think we want to deal with the mismatch 
in the backend, so the frontend should either insert casts to get the types to 
match or not generate the filter.


http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@741
PS29, Line 741:     SlotRef slotRefInJoin = 
filter.getSrcExpr().unwrapSlotRef(true);
I'm a little concerned about this part of checking type incompatibility - this 
strips implicit casts from the expression that's used to generate the runtime 
filter. But I think the runtime filter received will have values generated from 
expressions including the cast, which may have a different type from the 
original slot.

I think we probably need to test some cases where the join is on different 
types and the planner implicitly casts one or both sides of the equality 
predicate.


http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@790
PS29, Line 790:             // TODO: Apply overlap predicates to filter out 
partitions.
Hmm, the partition filtering via the original min-max predicates should 
probably just work aside from it being disabled here.


http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429
PS29, Line 429: ###################################################
Can we also added a test where we use a query option to disable the min-max 
filters and make sure that it has the intended effect.


http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@496
PS29, Line 496: explain select count(*) from
This explain tests should be implemented as PlannerTests, e.g. in 
runtime-filter-propagation.test or in a new planner test file.



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 08 Dec 2020 19:43:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

Reply via email to