Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16720 )
Change subject: IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate ...................................................................... Patch Set 29: (10 comments) I looked at the tests and the frontend part. I didn't look in detail at the core of the backend implementation yet. http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/exec/parquet/hdfs-parquet-scanner.cc@117 PS29, Line 117: NumMinMaxFilteredRowGroups I wonder if it would make more sense to make this NumRuntimeFilteredRowGroups. If we extended this in future to evaluate bloom filters or different kind of filters, say, I don't think we'd necessarily want a separate counter. http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/decimal-value.h File be/src/runtime/decimal-value.h: http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/decimal-value.h@153 PS29, Line 153: /// Implementations of add and subtract operators for Decimal4Value, Decimal8Value Are these test-only functions? I don't like including these in the DecimalValue class because I think they could encourage antipatterns - generally in query execution we don't want to mix decimals and doubles and this would make it kinda easy. I think it would probably be better if callers did the conversion from double explicitly, or if the helper functions were implemented in the tests. http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h@284 PS29, Line 284: add nit: Add, we only use lower case for simple accessor methods - https://google.github.io/styleguide/cppguide.html#Function_Names http://gerrit.cloudera.org:8080/#/c/16720/29/be/src/runtime/timestamp-value.h@288 PS29, Line 288: subtract nit: Subtract http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@679 PS29, Line 679: Boolean should be returning scalar bool type afaict, not the object version. http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@714 PS29, Line 714: // value, DecimalValue, does not carry precision and scale with it, which makes This isn't really true, there's always an associated ColumnType descriptor that has that metadata. That said, I don't think we want to deal with the mismatch in the backend, so the frontend should either insert casts to get the types to match or not generate the filter. http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@741 PS29, Line 741: SlotRef slotRefInJoin = filter.getSrcExpr().unwrapSlotRef(true); I'm a little concerned about this part of checking type incompatibility - this strips implicit casts from the expression that's used to generate the runtime filter. But I think the runtime filter received will have values generated from expressions including the cast, which may have a different type from the original slot. I think we probably need to test some cases where the join is on different types and the planner implicitly casts one or both sides of the equality predicate. http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/16720/29/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@790 PS29, Line 790: // TODO: Apply overlap predicates to filter out partitions. Hmm, the partition filtering via the original min-max predicates should probably just work aside from it being disabled here. http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test: http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@429 PS29, Line 429: ################################################### Can we also added a test where we use a query option to disable the min-max filters and make sure that it has the intended effect. http://gerrit.cloudera.org:8080/#/c/16720/29/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@496 PS29, Line 496: explain select count(*) from This explain tests should be implemented as PlannerTests, e.g. in runtime-filter-propagation.test or in a new planner test file. -- To view, visit http://gerrit.cloudera.org:8080/16720 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691 Gerrit-Change-Number: 16720 Gerrit-PatchSet: 29 Gerrit-Owner: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 08 Dec 2020 19:43:21 +0000 Gerrit-HasComments: Yes