Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
......................................................................


Patch Set 8:

(8 comments)

Replied to and added some more.

Can you please also point out the explain output with in-list filters? Love to 
see them.

It is unfortunate that there are massive number of filter Ids changes due to 
the introduction of the in-list type. I think some day we should re-assign the 
Ids at the end of compilation so that they are consecutive.

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221:
> Sorry that I'm not quite understand these.
 >
 > > I was originally thinking that when the target of a IN-list
 > filter is partition columns, then the target can be removed in FE.
 > > Doing the test here means such targets are retained in the plan
 > and do not contribute.
 >
 > Do you mean eliminating the partitions in FE? The IN-list filters
 > are generated in runtime based on the build side data of hash
 > joins. I'm afraid we are unable to eliminate them in the plan.
 > Instead, we will eliminate them in runtime in the code link you
 > pasted, ie. HdfsScanNodeBase::PartitionPassesFilters(). Did I miss
 > something?
 >
 > > Personally, I feel we should allow the target to be a partition
 > column in this patch to pick up good performance gain, especially
 > for large tables with hundreds of partitions. The code to deal with
 > partition column is here: 
 > https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922.
 > Seems your code will work out of box in this situation if line
 > @1221 is removed.
 >
 > UpdateSearchArgumentWithFilters() is only used in the orc scanner
 > to push down filters into the ORC lib. We need line 1221 since
 > partition columns don't exist in the ORC files.
 >
 > The logics of HdfsScanNodeBase::PartitionPassesFilters() still
 > apply on IN-list filters. I don't see it skip using IN-list
 > filters. So we already support it that filtering out unrelated
 > partitions by the IN-list filters. Or did I miss something?

Okay. I think you are right. The line at 1221 is a protection for not applying 
the filter on the data files. Sorry I missed that one.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: ataType predicate_type
> > Calling PrepareSearchArguments() for each ORC stripe may be an overkill.
It seems to me starting filtering without waiting for the merge version to 
arrive can produce incorrect/non-deterministic results. For example, assume 
values [1, 2, 10] in the first stripe, and the merged filter is [1, 2].  If a 
partial filter [2] arrives and is applied, then [1, 10] will be eliminated. 
However [1] is the answer.

Since all filter predicates are conjunctive, it is okay to use a subset of it, 
which may reduce the filtering efficiency. But the result is still correct. 
Each filter must be the merged version though.


http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift@725
PS8, Line 725: RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT
nit. IN_LIST_FILTER_ENTRY_LIMIT?


http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift@578
PS8, Line 578: runtime_in_list_filter_entry_limit
nit. in_list_filter_entry_limit?


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> The above casting is handled in BE in the orc scanner, because the underlyi
Okay. Agree casting in BE is the right way to go if data types in orc file can 
be different from table schema.

But doing a feasibility check here for the inner should be done for the reasons 
mentioned.


http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@689
PS8, Line 689: 8
I wonder if this can be improved a little bit, especially for int type, to save 
some spaces.

It is impossible for a column in ORC data file to contain 8-byte integer while 
the column type is 4-byte int, right?


http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
File testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test:

http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test@8
PS8, Line 8: 3.42K
Saw quite many changes on cardinality. Can you explain the reason?


http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
File testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test:

http://gerrit.cloudera.org:8080/#/c/18141/8/testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test@13
PS8, Line 13: HDFS
Is this change introduced in this patch?



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <huangquanl...@gmail.com>
Gerrit-Comment-Date: Wed, 16 Feb 2022 17:01:53 +0000
Gerrit-HasComments: Yes

Reply via email to