[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10184/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 18 Feb 2022 12:15:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-18 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#10).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

For visibility, this patch addes two counters in the HdfsScanNode:
 - NumPushedDownPredicates
 - NumPushedDownRuntimeFilters
They reflect the predicates and runtime filters that are pushed down to
the ORC reader.

Ran perf tests on a 3 instances cluster on my desktop using TPC-DS with
scale factor 20. It shows significant improvements in some queries:

+---+-+++-++++---++-++
| Workload  | Query   | File Format| Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | 
Tval   |
+---+-+++-++++---++-++
| TPCDS(20) | TPCDS-Q67A  | orc / snap / block | 35.07  | 44.01   | I 
-20.32%  |   0.38%|   1.38%| 10| I -25.69%  | -3.58   | 
-45.33 |
| TPCDS(20) | TPCDS-Q37   | orc / snap / block | 1.08   | 1.45| I 
-25.23%  |   7.14%|   3.09%| 10| I -34.09%  | -3.58   | 
-12.94 |
| TPCDS(20) | TPCDS-Q70A  | orc / snap / block | 6.30   | 8.60| I 
-26.81%  |   5.24%|   4.21%| 10| I -36.67%  | -3.58   | 
-14.88 |
| TPCDS(20) | TPCDS-Q16   | orc / snap / block | 1.33   | 1.85| I 
-28.28%  |   4.98%|   5.92%| 10| I -39.38%  | -3.58   | 
-12.93 |
| TPCDS(20) | TPCDS-Q18A  | orc / snap / block | 5.70   | 8.06| I 
-29.25%  |   3.00%|   4.12%| 10| I -40.30%  | -3.58   | 
-19.95 |
| TPCDS(20) | TPCDS-Q22A  | orc / snap / block | 2.01   | 2.97| I 
-32.21%  |   6.12%|   5.94%| 10| I -47.68%  | -3.58   | 
-14.05 |
| TPCDS(20) | TPCDS-Q77A  | orc / snap / block | 8.49   | 12.44   | I 
-31.75%  |   6.44%|   3.96%| 10| I -49.71%  | -3.58   | 
-16.97 |
| TPCDS(20) | TPCDS-Q75   | orc / snap / block | 7.76   | 12.27   | I 
-36.76%  |   5.01%|   3.87%| 10| I -59.56%  | -3.58   | 
-23.26 |
| TPCDS(20) | TPCDS-Q21   | orc / snap / block | 0.71   | 1.27| I 
-44.26%  |   4.56%|   4.24%| 10| I -77.31%  | -3.58   | 
-28.31 |
| TPCDS(20) | TPCDS-Q80A  | orc / snap / block | 9.24   | 20.42   | I 
-54.77%  |   4.03%|   3.82%| 10| I -123.12% | -3.58   | 
-40.90 |
| TPCDS(20) | TPCDS-Q39-1 | orc / snap / block | 1.07   | 2.26| I 
-52.74%  | * 23.83% * |   2.60%| 10| I -149.68% | -3.58   | 
-14.43 |
| TPCDS(20) | TPCDS-Q39-2 | orc / snap / block | 1.00   | 2.33| I 
-56.95%  | * 19.53% * |   2.07%| 10| I -151.89% 

[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18141/10/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/10/be/src/exec/hdfs-orc-scanner.cc@318
PS10, Line 318:   ADD_COUNTER(scan_node_->runtime_profile(), 
"NumPushedDownRuntimeFilters", TUnit::UNIT);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/18141/10/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/10/tests/query_test/test_runtime_filters.py@70
PS10, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 18 Feb 2022 11:53:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 9:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/10181/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 18 Feb 2022 06:47:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-17 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18141/9/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/9/be/src/exec/hdfs-orc-scanner.cc@318
PS9, Line 318:   ADD_COUNTER(scan_node_->runtime_profile(), 
"NumPushedDownRuntimeFilters", TUnit::UNIT);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/18141/9/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/9/tests/query_test/test_runtime_filters.py@70
PS9, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 18 Feb 2022 06:35:45 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-17 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#9).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

For visibility, this patch addes two counters in the HdfsScanNode:
 - NumPushedDownPredicates
 - NumPushedDownRuntimeFilters
They reflect the predicates and runtime filters that are pushed down to
the ORC reader.

Ran perf tests on a 3 instances cluster on my desktop using TPC-DS with
scale factor 20. It shows significant improvements in some queries:

+---+-+++-++++---++-++
| Workload  | Query   | File Format| Avg(s) | Base Avg(s) | 
Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | 
Tval   |
+---+-+++-++++---++-++
| TPCDS(20) | TPCDS-Q67A  | orc / snap / block | 35.07  | 44.01   | I 
-20.32%  |   0.38%|   1.38%| 10| I -25.69%  | -3.58   | 
-45.33 |
| TPCDS(20) | TPCDS-Q37   | orc / snap / block | 1.08   | 1.45| I 
-25.23%  |   7.14%|   3.09%| 10| I -34.09%  | -3.58   | 
-12.94 |
| TPCDS(20) | TPCDS-Q70A  | orc / snap / block | 6.30   | 8.60| I 
-26.81%  |   5.24%|   4.21%| 10| I -36.67%  | -3.58   | 
-14.88 |
| TPCDS(20) | TPCDS-Q16   | orc / snap / block | 1.33   | 1.85| I 
-28.28%  |   4.98%|   5.92%| 10| I -39.38%  | -3.58   | 
-12.93 |
| TPCDS(20) | TPCDS-Q18A  | orc / snap / block | 5.70   | 8.06| I 
-29.25%  |   3.00%|   4.12%| 10| I -40.30%  | -3.58   | 
-19.95 |
| TPCDS(20) | TPCDS-Q22A  | orc / snap / block | 2.01   | 2.97| I 
-32.21%  |   6.12%|   5.94%| 10| I -47.68%  | -3.58   | 
-14.05 |
| TPCDS(20) | TPCDS-Q77A  | orc / snap / block | 8.49   | 12.44   | I 
-31.75%  |   6.44%|   3.96%| 10| I -49.71%  | -3.58   | 
-16.97 |
| TPCDS(20) | TPCDS-Q75   | orc / snap / block | 7.76   | 12.27   | I 
-36.76%  |   5.01%|   3.87%| 10| I -59.56%  | -3.58   | 
-23.26 |
| TPCDS(20) | TPCDS-Q21   | orc / snap / block | 0.71   | 1.27| I 
-44.26%  |   4.56%|   4.24%| 10| I -77.31%  | -3.58   | 
-28.31 |
| TPCDS(20) | TPCDS-Q80A  | orc / snap / block | 9.24   | 20.42   | I 
-54.77%  |   4.03%|   3.82%| 10| I -123.12% | -3.58   | 
-40.90 |
| TPCDS(20) | TPCDS-Q39-1 | orc / snap / block | 1.07   | 2.26| I 
-52.74%  | * 23.83% * |   2.60%| 10| I -149.68% | -3.58   | 
-14.43 |
| TPCDS(20) | TPCDS-Q39-2 | orc / snap / block | 1.00   | 2.33| I 
-56.95%  | * 19.53% * |   2.07%| 10| I -151.89% 

[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-16 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 8:

(8 comments)

Replied to and added some more.

Can you please also point out the explain output with in-list filters? Love to 
see them.

It is unfortunate that there are massive number of filter Ids changes due to 
the introduction of the in-list type. I think some day we should re-assign the 
Ids at the end of compilation so that they are consecutive.

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221:
> Sorry that I'm not quite understand these.
 >
 > > I was originally thinking that when the target of a IN-list
 > filter is partition columns, then the target can be removed in FE.
 > > Doing the test here means such targets are retained in the plan
 > and do not contribute.
 >
 > Do you mean eliminating the partitions in FE? The IN-list filters
 > are generated in runtime based on the build side data of hash
 > joins. I'm afraid we are unable to eliminate them in the plan.
 > Instead, we will eliminate them in runtime in the code link you
 > pasted, ie. HdfsScanNodeBase::PartitionPassesFilters(). Did I miss
 > something?
 >
 > > Personally, I feel we should allow the target to be a partition
 > column in this patch to pick up good performance gain, especially
 > for large tables with hundreds of partitions. The code to deal with
 > partition column is here: 
 > https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922.
 > Seems your code will work out of box in this situation if line
 > @1221 is removed.
 >
 > UpdateSearchArgumentWithFilters() is only used in the orc scanner
 > to push down filters into the ORC lib. We need line 1221 since
 > partition columns don't exist in the ORC files.
 >
 > The logics of HdfsScanNodeBase::PartitionPassesFilters() still
 > apply on IN-list filters. I don't see it skip using IN-list
 > filters. So we already support it that filtering out unrelated
 > partitions by the IN-list filters. Or did I miss something?

Okay. I think you are right. The line at 1221 is a protection for not applying 
the filter on the data files. Sorry I missed that one.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: ataType predicate_type
> > Calling PrepareSearchArguments() for each ORC stripe may be an overkill.
It seems to me starting filtering without waiting for the merge version to 
arrive can produce incorrect/non-deterministic results. For example, assume 
values [1, 2, 10] in the first stripe, and the merged filter is [1, 2].  If a 
partial filter [2] arrives and is applied, then [1, 10] will be eliminated. 
However [1] is the answer.

Since all filter predicates are conjunctive, it is okay to use a subset of it, 
which may reduce the filtering efficiency. But the result is still correct. 
Each filter must be the merged version though.


http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/ImpalaService.thrift@725
PS8, Line 725: RUNTIME_IN_LIST_FILTER_ENTRY_LIMIT
nit. IN_LIST_FILTER_ENTRY_LIMIT?


http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/18141/8/common/thrift/Query.thrift@578
PS8, Line 578: runtime_in_list_filter_entry_limit
nit. in_list_filter_entry_limit?


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> The above casting is handled in BE in the orc scanner, because the underlyi
Okay. Agree casting in BE is the right way to go if data types in orc file can 
be different from table schema.

But doing a feasibility check here for the inner should be done for the reasons 
mentioned.


http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/8/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@689
PS8, Line 689: 8
I wonder if this can be improved a little bit, especially for int type, to save 
some spaces.

It is impossible for a column in ORC data file to contain 8-byte integer while 
the column type is 4-byte int, right?



[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-15 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 8:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221:
Sorry that I'm not quite understand these.

> I was originally thinking that when the target of a IN-list filter is 
> partition columns, then the target can be removed in FE.
> Doing the test here means such targets are retained in the plan and do not 
> contribute.

Do you mean eliminating the partitions in FE? The IN-list filters are generated 
in runtime based on the build side data of hash joins. I'm afraid we are unable 
to eliminate them in the plan. Instead, we will eliminate them in runtime in 
the code link you pasted, ie. HdfsScanNodeBase::PartitionPassesFilters(). Did I 
miss something?

> Personally, I feel we should allow the target to be a partition column in 
> this patch to pick up good performance gain, especially for large tables with 
> hundreds of partitions. The code to deal with partition column is here: 
> https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922.
>  Seems your code will work out of box in this situation if line @1221 is 
> removed.

UpdateSearchArgumentWithFilters() is only used in the orc scanner to push down 
filters into the ORC lib. We need line 1221 since partition columns don't exist 
in the ORC files.

The logics of HdfsScanNodeBase::PartitionPassesFilters() still apply on IN-list 
filters. I don't see it skip using IN-list filters. So we already support it 
that filtering out unrelated partitions by the IN-list filters. Or did I miss 
something?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: ataType predicate_type
> Calling PrepareSearchArguments() for each ORC stripe may be an overkill.

Yeah, it could be an overkill if we have lots of predicates and runtime IN-list 
filters to push down. Runtime filters arrive randomly so we need to call this 
whenever there is a new runtime filter arrive. I think we can improve this by 
checking the arrival filters count in PrepareSearchArguments() and return if no 
new IN-list filters arrive.

> My understanding is that there is a consolidation step to merge the filters 
> from different partitions (for PARTITIONED HJ). Only the merged filter can 
> arrive at the scan node. For BROADCAST HJ, such merge step os not needed.

Yeah, we don't have the merge step for IN-list filter. However, they can arrive 
here since the coordinator will still publish them.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> It also depends on how ORC layer handles the types.
The above casting is handled in BE in the orc scanner, because the underlying 
ORC files could have different schemas. We can only know the file schema after 
we parse the file footer. The casting codes are in 
HdfsOrcScanner::GetSearchArgumentLiteral().

I think in FE, we just need to make sure these types are supported in BE. The 
BE codes will cast values based on the ORC file schema, or skip using the 
filter if the casting failed.

BTW, the Java implementation of the ORC lib is slightly different to its C++ 
implementation. The ORC C++ lib currently supports these types: 
https://github.com/apache/orc/blob/rel/release-1.7.0/c++/include/orc/sargs/Literal.hh#L72-L110



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Wed, 16 Feb 2022 02:55:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10156/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 15 Feb 2022 00:03:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 8:

> Patch Set 6:
>
> (6 comments)
>
> Thanks!

Thank Qifan! I'll address your comments in the next patch set.

Patch set 7 fixes the failed tests and add two profile counters.
Patch set 8 is a rebase to fix the merge conflicts.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 14 Feb 2022 23:43:24 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18141/8/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/8/be/src/exec/hdfs-orc-scanner.cc@318
PS8, Line 318:   ADD_COUNTER(scan_node_->runtime_profile(), 
"NumPushedDownRuntimeFilters", TUnit::UNIT);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/18141/8/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/8/tests/query_test/test_runtime_filters.py@70
PS8, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 14 Feb 2022 23:41:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10155/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 14 Feb 2022 23:42:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#8).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, runtime_in_list_filter_entry_limit.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

TODO: Codegen InListFilter::Insert() and InListFilter::Find().

For visibility, this patch addes two counters in the HdfsScanNode:
 - NumPushedDownPredicates
 - NumPushedDownRuntimeFilters
They reflect the predicates and runtime filters that are pushed down to
the ORC reader.

Tests:
 - Many planner tests have changes in the runtime filter ids.
 - TODO: Test IN-list filter with NULLs
 - TODO: Test IN-list filter on complex exprs targets
 - TODO: Test IN-list filter on all types including DATE

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/bloom-filter-assignment.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-inner-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-multi-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/conjunct-ordering.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-propagation.test
M 

[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18141/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/7/be/src/exec/hdfs-orc-scanner.cc@160
PS7, Line 160:   ADD_COUNTER(scan_node_->runtime_profile(), 
"NumPushedDownRuntimeFilters", TUnit::UNIT);
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/18141/7/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/7/tests/query_test/test_runtime_filters.py@70
PS7, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 14 Feb 2022 23:18:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#7).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, runtime_in_list_filter_entry_limit.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

TODO: Codegen InListFilter::Insert() and InListFilter::Find().

For visibility, this patch addes two counters in the HdfsScanNode:
 - NumPushedDownPredicates
 - NumPushedDownRuntimeFilters
They reflect the predicates and runtime filters that are pushed down to
the ORC reader.

Tests:
 - Many planner tests have changes in the runtime filter ids.
 - TODO: Test IN-list filter with NULLs
 - TODO: Test IN-list filter on complex exprs targets
 - TODO: Test IN-list filter on all types including DATE

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M testdata/workloads/functional-planner/queries/PlannerTest/acid-scans.test
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/bloom-filter-assignment.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-inner-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-multi-join.test
M testdata/workloads/functional-planner/queries/PlannerTest/card-outer-join.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/conjunct-ordering.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/constant-propagation.test
M 

[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-14 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 6:

(6 comments)

Thanks!

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30
PS4, Line 30:
> change to "with"? It means the IN-list has 4 items.
Okay.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34
PS4, Line 34:  ps_partkey and l_suppkey = ps_suppkey;
:
> You are right but not sure we have misunderstanding here. There are two kin
Good to know! Thanks for the explanation.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221: f (in_list_filter->AlwaysTrue()) continue;
> Yeah, the check is done by FE: https://github.com/apache/impala/blob/6c845e
I was originally thinking that when the target of a IN-list filter is partition 
columns, then the target can be removed in FE.

Doing the test here means such targets are retained in the plan and do not 
contribute.

Personally, I feel we should allow the target to be a partition column in this 
patch to pick up good performance gain, especially for large tables with 
hundreds of partitions. The code to deal with partition column is here: 
https://github.com/apache/impala/blob/master/be/src/exec/hdfs-scan-node-base.cc#L922.
 Seems your code will work out of box in this situation if line @1221 is 
removed.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271:
> PrepareSearchArguments() will be called multiple times after this patch. Th
Okay.

Calling PrepareSearchArguments() for each ORC stripe may be an overkill. My 
understanding is that there is a consolidation step to merge the filters from 
different partitions (for PARTITIONED HJ). Only the merged filter can arrive at 
the scan node. For BROADCAST HJ, such merge step os not needed.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java:

http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@394
PS4, Line 394: r
> I think it's assumed that both sides are casted to the same type. EQUALS pr
It also depends on how ORC layer handles the types.

>From https://orc.apache.org/api/orc-core/org/apache/orc/Reader.Options.html, 
>https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/io/sarg/SearchArgument.html?is-external=true
> and 
>https://orc.apache.org/api/hive-storage-api/org/apache/hadoop/hive/ql/io/sarg/PredicateLeaf.html,
> it seems the literal list can only take one of the four primitive typed 
>objects: Integer, Long, Double, or String. Denote such a type T.  Then 
>technically, it is sufficient that both the inner and the outer, after 
>optional casting, are of type T. Note also that we need to verify the 
>surviving column values because of IN-list predicates being mapped to ORC 
>bloom filters.

The rules of casting may be like this, in the order of priority.

1. If either the inner or outer is small/tiny int, cast both to int;
2. If either is less than or equal to int, cast both to int;
3. If either is less than or equal to big int, cast both to big int;
4. If either is less than or equal to double, cast both to double;
5. If either is SQL character types, cast both to string;


I think it is a good idea to verify the types here to make it possible to 
detect type mismatch early.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@742
PS4, Line 742:   public int compare(RuntimeFilter a, RuntimeFilter b) {
> I think it's very likely that partitioned HJs will exceed the threshold. Bu
Sounds like a good idea to handle partitioned HJs in another JIRA.

We can borrow BE code from min/max filters to handle both 1) and 2).



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Mon, 14 Feb 2022 17:04:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10115/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 08 Feb 2022 03:29:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#6).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, runtime_in_list_filter_entry_limit.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

TODO: fix tests due to plan changes.

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M tests/query_test/test_runtime_filters.py
36 files changed, 857 insertions(+), 160 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/18141/6
--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18141/6/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/6/tests/query_test/test_runtime_filters.py@70
PS6, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 08 Feb 2022 03:04:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10113/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 08 Feb 2022 02:53:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 5:

Thanks for your feedback, Qifan! Addressed the comments. I'm still 
updating/adding tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 08 Feb 2022 02:30:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 5:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@10
PS4, Line 10: ++ reader supports pushing down predicates to skip
: unreleated RowGroups. The pushed down predicates will be 
evaludated on
: file indexes (i.e. statistics and bloom filter indexes). Note 
that only
: EQUALS and IN-list predicates can leverage bloom
> May reword as
Sorry for making this unclear.

* The native ORC library can accept many kinds of predicates, not just EQUALS 
and IN-list predicates, but also comparison (e.g. <, >, >=) and IS-[NOT]-NULL 
predicates, etc. They can both be used to skip unreleated ORC RowGroups.
* Each ORC files can have optional bloom filters on different columns.
* Only EQUALS and IN-list predicates can leverage these file-level bloom 
filters.

Updated the sentenses. But not sure if they are clear enough.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@18
PS4, Line 18: indexes.
:
: This patch adds runtime IN-list filters for this
> Suggest to mention it after the introduction section. That is, right before
Done


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30
PS4, Line 30:
> nit. remove
change to "with"? It means the IN-list has 4 items.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34
PS4, Line 34:  ps_partkey and l_suppkey = ps_suppkey;
:
> Not sure if this is right. I thought IN-list will be done inside ORC librar
You are right but not sure we have misunderstanding here. There are two kinds 
of bloom filters:

* Runtime bloom filters generated by Impala
* Bloom filter indexes in the ORC files (generated by Hive when inserting the 
table)

If the lineitem table is generated with bloom filter indexes, the runtime 
IN-list filter can have a better filter rate.
Updated the sentense.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221: / Only apply runtime filters on non-partition columns.
> Looks like this can be done in FE.
Yeah, the check is done by FE: 
https://github.com/apache/impala/blob/6c845eb24b952972975126e07a36cd1565ada629/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L936

Here we only check the flag set by FE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: < filter->id();
> I wonder if this method UpdateSearchArgumentWithFilters() is called only on
PrepareSearchArguments() will be called multiple times after this patch. Thus 
the same as UpdateSearchArgumentWithFilters(). The reason is runtime filters 
will arrive in runtime. So we re-build the SearchArgument each time we start 
reading a new ORC stripe.

However, the above situation seems impossible. When an IN-list filter arrived, 
it won't be updated anymore. So the predicate should remain the same.

BTW, I updated the method comment of PrepareSearchArguments(). Please let me 
know if it's unclear.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc@959
PS4, Line 959: //TODO: IN-list filter threshold (default 1024).
> Sounds like this is quite important.  When the # items in the list in HJ bu
Yeah, I added this in the commit message in PS5. Also added the query option.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc@599
PS4, Line 599: In-l
> In-list size?
Done


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc@40
PS4, Line 40:
: case TRuntimeFilterType::IN_LIST: {
> Seems to me IN_list will shine in performance when applied to partition col
Thanks for catching this! I thought this will only be evaludated in rows level. 
I should add the skip logic in scanners.

EDIT: moved the check to HdfsScanner::EvalRuntimeFilter()


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h
File be/src/runtime/runtime-filter.inline.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@32
PS4, Line 32: switch (filter_desc()
> Switch on filter_desc().type to save some IF tests?
Good point! Done.



[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Quanlong Huang (Code Review)
Hello Qifan Chen, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#5).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

ORC files have optional bloom filter indexes for each column. Since
ORC-1.7.0, the C++ reader supports pushing down predicates to skip
unreleated RowGroups. The pushed down predicates will be evaludated on
file indexes (i.e. statistics and bloom filter indexes). Note that only
EQUALS and IN-list predicates can leverage bloom filter indexes.

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't be converted into EQUALS or
IN-list predicates. So they can't leverage the file level bloom filter
indexes.

This patch adds runtime IN-list filters for this purpose. Currently they
are generated only for small build side (e.g. #rows <= 1024) of a
broadcast join. They will only be applied on ORC tables and be pushed
down to the ORC reader(i.e. ORC lib). To avoid exploding the IN-list,
if #rows of the build side exceeds the threshold (1024), we set the
filter to ALWAYS_TRUE. The threshold can be configured by a new query
option, runtime_in_list_filter_entry_limit.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter with 4 items. Note that
we need to re-generate the lineitem table with bloom filter indexes enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties before inserting the data), so the runtime IN-list filter
can have a better filter rate.

Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

TODO: fix tests due to plan changes.

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M tests/query_test/test_runtime_filters.py
36 files changed, 857 insertions(+), 160 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/18141/5
--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-07 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18141/5/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/18141/5/tests/query_test/test_runtime_filters.py@70
PS5, Line 70: [
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Tue, 08 Feb 2022 02:30:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-04 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 4:

(16 comments)

Looks good to me!

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@10
PS4, Line 10: Unfortunately they can't leverage the bloom filters in
: ORC files. Because only EQUALS and IN-list predicates can 
leverage them
: to skip unrelated ORC RowGroups, and we can't convert runtime 
bloom
: filters or min-max filters into such predicates.
May reword as

Unfortunately the native ORC library can only accept EQUALS and IN-list to skip 
related ORC RowGroups, to which both runtime bloom or min-max filters can't be 
converted.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@18
PS4, Line 18: Evaluating runtime IN-list filters is much slower than evaluating
: runtime bloom filters due to the current simple implementation 
(i.e.
: std::unorder_set). So we disable it at row level.
Suggest to mention it after the introduction section. That is, right before the 
TODO.


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@30
PS4, Line 30: of
nit. remove


http://gerrit.cloudera.org:8080/#/c/18141/4//COMMIT_MSG@34
PS4, Line 34: so the pushed down IN-list filter can have a better
: filter rate.
Not sure if this is right. I thought IN-list will be done inside ORC library 
layer and bloom in impala layer.

Maybe say it as: Note that in-list filters and bloom filters are orthogonal 
because of different operation locations, it is desirable to keep the bloom 
filters in the query plan.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1221
PS4, Line 1221: f (filter->IsBoundByPartitionColumn(GetScanNodeId())) continue;
Looks like this can be done in FE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/hdfs-orc-scanner.cc@1271
PS4, Line 1271: PrepareInListPredicate
I wonder if this method UpdateSearchArgumentWithFilters() is called only once. 
Since PrepareInListPredicate() can put the ORC predicate in two forms, and if 
this method is called multiple times, then we could end up with the following 
interesting situation:

1. List of 1 item -> EQUALS form;
2. List of 4 times -> IN-LIST form;

The final form should be the IN-LIST form including the item from 1.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/exec/partitioned-hash-join-builder.cc@959
PS4, Line 959: //TODO: IN-list filter threshold (default 1024).
Sounds like this is quite important.  When the # items in the list in HJ 
builder is over the threshold, we set the filter to ALWAYS TRUE.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/coordinator.cc@599
PS4, Line 599: List
In-list size?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc
File be/src/runtime/runtime-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter-ir.cc@40
PS4, Line 40: Evaluating IN-list filter is much slower than evaluating the 
corresponding bloom
: // filter. Skip it until we improve its performance.
Seems to me IN_list will shine in performance when applied to partition columns.


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h
File be/src/runtime/runtime-filter.inline.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@32
PS4, Line 32: if (is_bloom_filter()
Switch on filter_desc().type to save some IF tests?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/runtime/runtime-filter.inline.h@43
PS4, Line 43: if (is_bloom_filter()
Switch on filter_desc().type to save some IF tests?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc
File be/src/util/in-list-filter-ir.cc:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter-ir.cc@30
PS4, Line 30: 1024
Turn this into a query option?


http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h
File be/src/util/in-list-filter.h:

http://gerrit.cloudera.org:8080/#/c/18141/4/be/src/util/in-list-filter.h@89
PS4, Line 89: int64_t
May consider the exact type (int8_t, int16_t, int32_t or int64_t), similar to 
min/max filters, to save memory space.


http://gerrit.cloudera.org:8080/#/c/18141/4/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
File 

[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10094/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 04 Feb 2022 00:02:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-03 Thread Quanlong Huang (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#4).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't leverage the bloom filters in
ORC files. Because only EQUALS and IN-list predicates can leverage them
to skip unrelated ORC RowGroups, and we can't convert runtime bloom
filters or min-max filters into such predicates.

This patch adds runtime IN-list filters for small build side (e.g. #rows
<= 1024) of a broadcast join. Currently the IN-list filters will only
apply to ORC tables and be pushed down to the ORC reader(i.e. ORC lib).
Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter of 4 items. Note that
we need to re-generate the lineitem table with bloom filters enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties), so the pushed down IN-list filter can have a better
filter rate.

TODO: fix tests due to plan changes.

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/orc-metadata-utils.cc
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M tests/query_test/test_runtime_filters.py
31 files changed, 748 insertions(+), 122 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/18141/4
--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/10085/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 02 Feb 2022 09:16:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-02-02 Thread Quanlong Huang (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/18141

to look at the new patch set (#2).

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't leverage the bloom filters in
ORC files. Because only EQUALS and IN-list predicates can leverage them
to skip unrelated ORC RowGroups, and we can't convert runtime bloom
filters or min-max filters into such predicates.

This patch adds runtime IN-list filters for small build side (e.g. #rows
<= 1024) of a broadcast join. Currently the IN-list filters will only
apply to ORC tables and be pushed down to the ORC reader(i.e. ORC lib).
Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter of 4 items. Note that
we need to re-generate the lineitem table with bloom filters enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties), so the pushed down IN-list filter can have a better
filter rate.

TODO: fix tests due to plan changes.

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M tests/query_test/test_runtime_filters.py
30 files changed, 752 insertions(+), 120 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/18141/2
--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-01-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18141 )

Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/1/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 11 Jan 2022 10:25:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

2022-01-11 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/18141


Change subject: WIP IMPALA-10898: Add runtime IN-list filters for ORC tables
..

WIP IMPALA-10898: Add runtime IN-list filters for ORC tables

Currently Impala has two kinds of runtime filters: bloom filter and
min-max filter. Unfortunately they can't leverage the bloom filters in
ORC files. Because only EQUALS and IN-list predicates can leverage them
to skip unrelated ORC RowGroups, and we can't convert runtime bloom
filters or min-max filters into such predicates.

This patch adds runtime IN-list filters for small build side (e.g. #rows
<= 1024) of a broadcast join. Currently the IN-list filters will only
apply to ORC tables and be pushed down to the ORC reader(i.e. ORC lib).
Evaluating runtime IN-list filters is much slower than evaluating
runtime bloom filters due to the current simple implementation (i.e.
std::unorder_set). So we disable it at row level.

Example query that will benefit from this patch:
  use tpch_orc_def;
  select count(*) from lineitem_bf join (
select * from partsupp, part
where ps_partkey = p_partkey and p_size = 15
  and p_type like '%BRASS' and ps_availqty < 10) v
  on l_partkey = ps_partkey and l_suppkey = ps_suppkey;

The inline-view populates a runtime IN-list filter of 4 items. Note that
we need to re-generate the lineitem table with bloom filters enabled
(e.g. setting orc.bloom.filter.columns to
"l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity" in
tblproperties), so the pushed down IN-list filter can have a better
filter rate.

TODO: fix tests due to plan changes.

Change-Id: I25080628233799aa0b6be18d5a832f1385414501
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator-filter-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/runtime/runtime-filter-bank.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/runtime-filter-test.cc
M be/src/runtime/runtime-filter.cc
M be/src/runtime/runtime-filter.h
M be/src/runtime/runtime-filter.inline.h
M be/src/service/data-stream-service.cc
M be/src/service/query-options-test.cc
M be/src/util/CMakeLists.txt
A be/src/util/in-list-filter-ir.cc
A be/src/util/in-list-filter.cc
A be/src/util/in-list-filter.h
M common/protobuf/data_stream_service.proto
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M tests/query_test/test_runtime_filters.py
30 files changed, 750 insertions(+), 120 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/41/18141/1
--
To view, visit http://gerrit.cloudera.org:8080/18141
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I25080628233799aa0b6be18d5a832f1385414501
Gerrit-Change-Number: 18141
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang