[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 43:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7910/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 43
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 27 Dec 2020 05:56:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 42:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7909/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 42
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 27 Dec 2020 05:49:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-26 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#43). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped. An overlap predicate exists
as a min/max filter.

For the following query, the min and max in such a min/max filter are
computed with the values from the join column from table 'b' and become
fully available when the entire hash table is built. To evaluate the
overlap predicate, these two values are compared against the min/max
of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem_sorted_l_shipdate a join [SHUFFLE]
   lineitem_sorted_l_shipdate b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

Like any existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option
does not apply to min/max filters created for overlap predicates.
The overlap predicates will always be evaluated, after the min/max
conjuncts (if any).

Two new run-time profile counters are added to report the number of row
groups or pages filtered out via the overlap predicates respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted, or when the min/max boundary values of the pages or row
   groups are monotonic;
2. Added new tests in min_max_filters.test to demonstrate the number
   of filtered out pages and row groups with the two new profile counters;
2. Added new tests in runtime-filter-propagation.test to demonstrate
   that the overlap predicates work with different column types;
4. Added data type specific overlap method tests in
   min-max-filter-test.cc;
5. Core testing.

TBD in this patch:
1. Performance measurement.

To do in follow-up JIRAs:
1. Apply the overlap predicate on partition columns;
2. Apply the overlap predicate on each row;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/disable-runtime-overlap-filter.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M 

[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..


Patch Set 42:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16720/42/fe/src/test/java/org/apache/impala/planner/PlannerTest.java
File fe/src/test/java/org/apache/impala/planner/PlannerTest.java:

http://gerrit.cloudera.org:8080/#/c/16720/42/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@757
PS42, Line 757: options.setDisable_overlap_filter(true); // Required so 
that output doesn't vary by whether parquet tables are used or not.
line too long (127 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/42/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@787
PS42, Line 787: options.setDisable_overlap_filter(true); // Required so 
that output doesn't vary by the format of the table used.
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/16720/42/fe/src/test/java/org/apache/impala/planner/PlannerTest.java@787
PS42, Line 787: options.setDisable_overlap_filter(true); // Required so 
that output doesn't vary by the format of the table used.
line too long (118 > 90)


http://gerrit.cloudera.org:8080/#/c/16720/42/tests/run-tests.py
File tests/run-tests.py:

http://gerrit.cloudera.org:8080/#/c/16720/42/tests/run-tests.py@219
PS42, Line 219: %
flake8: E131 continuation line unaligned for hanging indent



--
To view, visit http://gerrit.cloudera.org:8080/16720
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
Gerrit-Change-Number: 16720
Gerrit-PatchSet: 42
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Sun, 27 Dec 2020 05:28:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on equi-join predicate

2020-12-26 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#42). ( 
http://gerrit.cloudera.org:8080/16720 )

Change subject: IMPALA-10325: Parquet scan should use min/max statistics to 
skip pages based on equi-join predicate
..

IMPALA-10325: Parquet scan should use min/max statistics to skip pages based on 
equi-join predicate

This patch adds a new class of predicates called overlap predicates
to aid in the determination of whether a Parquet row group or a page
overlap with a range computed from an equi hash join. If not, then
the entire row group or page are skipped. An overlap predicate exists
as a min/max filter.

For the following query, the min and max in such a min/max filter are
computed with the values from the join column from table 'b' and become
fully available when the entire hash table is built. To evaluate the
overlap predicate, these two values are compared against the min/max
of each row group or page at the scan node for 'a'.

  select straight_join count(*)
  from lineitem_sorted_l_shipdate a join [SHUFFLE]
   lineitem_sorted_l_shipdate b
  where a.l_shipdate = b.l_receiptdate
  and b.l_commitdate = "1992-01-31";

An overlap predicate associated with the column type J (in hash table)
and scan column type S will be formed when one of the following is true:
   Both J and S are booleans
   Both J and S are integers (tinyint, smallint, int, or bigint)
   Both J and S are approximate numeric (float or double)
   Both J and S are decimals with the same precision and scale
   Both J and S are strings (STRING, CHAR or VARCHAR)
   Both J and S are date
   Both J and S are timestamp

Like any existing min/max filters, MAX_NUM_RUNTIME_FILTERS query option
does not apply to min/max filters created for overlap predicates.
The overlap predicates will always be evaluated, after the min/max
conjuncts (if any).

Two new run-time profile counters are added to report the number of row
groups or pages filtered out via the overlap predicates respectively:
  1. NumRuntimeFilteredRowGroups
  2. NumRuntimeFilteredPages

Testing:
1. Unit tested on various column types with TPCH and TPCDS tables.
   Benefits were significant when the join column on the outer table
   is sorted, or when the min/max boundary values of the pages or row
   groups are monotonic;
2. Added new tests in min_max_filters.test to demonstrate the number
   of filtered out pages and row groups with the two new profile counters;
2. Added new tests in runtime-filter-propagation.test to demonstrate
   that the overlap predicates work with different column types;
4. Added data type specific overlap method tests in
   min-max-filter-test.cc;
5. Core testing.

TBD in this patch:
1. Performance measurement.

To do in follow-up JIRAs:
1. Apply the overlap predicate on partition columns;
2. Apply the overlap predicate on each row;
3. IR code-gen for various MinMaxFilter::EvalOverlap methods.

Change-Id: I379405ee75b14929df7d6b5d20dabc6f51375691
---
M be/src/exec/exec-node.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/scan-node.cc
M be/src/runtime/coordinator.cc
M be/src/runtime/date-value.cc
M be/src/runtime/date-value.h
M be/src/runtime/runtime-filter-ir.cc
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/BinaryPredicate.java
M fe/src/main/java/org/apache/impala/analysis/Predicate.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/test/java/org/apache/impala/planner/PlannerTest.java
M testdata/workloads/functional-planner/queries/PlannerTest/aggregation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit-large.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/broadcast-bytes-limit.test
A 
testdata/workloads/functional-planner/queries/PlannerTest/disable-runtime-overlap-filter.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters-hdfs-num-rows-est-enabled.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/min-max-runtime-filters.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test
M 
testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
M 

[Impala-ASF-CR] IMPALA-10406: Query with analytic functions doesn't need to materialize the predicates bounded to kudu

2020-12-26 Thread Xianqing He (Code Review)
Xianqing He has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16905 )

Change subject: IMPALA-10406: Query with analytic functions doesn't need to 
materialize the predicates bounded to kudu
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16905/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16905/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1130
PS2, Line 1130:   // The predicates that can be bounded to KuduScanNode 
don't need to materialize
> I think this is incorrect, only some conjuncts get pushed into Kudu (see ex
I think it is not necessary to materialize the predicates pushed down to 
KuduScanNode here. Whether these predicates need to be materialized handled 
during the KuduScanNode#init. So if the prediacates can evaluate in kudu, it 
will not to be materilized.



--
To view, visit http://gerrit.cloudera.org:8080/16905
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Gerrit-Change-Number: 16905
Gerrit-PatchSet: 2
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Xianqing He 
Gerrit-Comment-Date: Sun, 27 Dec 2020 05:03:14 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10406: Query with analytic functions doesn't need to materialize the predicates bounded to kudu

2020-12-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16905 )

Change subject: IMPALA-10406: Query with analytic functions doesn't need to 
materialize the predicates bounded to kudu
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16905/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
File fe/src/main/java/org/apache/impala/analysis/SelectStmt.java:

http://gerrit.cloudera.org:8080/#/c/16905/2/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java@1130
PS2, Line 1130:   // The predicates that can be bounded to KuduScanNode 
don't need to materialize
I think this is incorrect, only some conjuncts get pushed into Kudu (see 
extractKuduConjuncts()).

For conjuncts that don't get pushed into Kudu, we still need to materialize the 
slots so we can evaluate them in the Impala KuduScanNode in the backend.

Am I missing something?



--
To view, visit http://gerrit.cloudera.org:8080/16905
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Gerrit-Change-Number: 16905
Gerrit-PatchSet: 2
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 26 Dec 2020 19:49:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10406: Query with analytic functions doesn't need to materialize the predicates bounded to kudu

2020-12-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16905 )

Change subject: IMPALA-10406: Query with analytic functions doesn't need to 
materialize the predicates bounded to kudu
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/7908/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16905
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Gerrit-Change-Number: 16905
Gerrit-PatchSet: 2
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Sat, 26 Dec 2020 16:15:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10406: Query with analytic functions doesn't need to materialize the predicates bounded to kudu

2020-12-26 Thread Xianqing He (Code Review)
Xianqing He has uploaded a new patch set (#2). ( 
http://gerrit.cloudera.org:8080/16905 )

Change subject: IMPALA-10406: Query with analytic functions doesn't need to 
materialize the predicates bounded to kudu
..

IMPALA-10406: Query with analytic functions doesn't need to materialize the 
predicates bounded to kudu

Before when query with analytic functions will materialize the
unassigned conjuncts.
But for the predicates that can be evaluated by kudu don't need to
materialize.

This optimization can reduce the amount of data to exchange and sort.

Testing:
 - Add planner test in analytic-fns.test

Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
---
M fe/src/main/java/org/apache/impala/analysis/SelectStmt.java
M testdata/workloads/functional-planner/queries/PlannerTest/analytic-fns.test
2 files changed, 135 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/05/16905/2
--
To view, visit http://gerrit.cloudera.org:8080/16905
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iba8371eff6ae1bcffd51b44843175c52f2127e46
Gerrit-Change-Number: 16905
Gerrit-PatchSet: 2
Gerrit-Owner: Xianqing He 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9922: A better approach to deal with date's sub-second fractions

2020-12-26 Thread fifteencai (Code Review)
fifteencai has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16869 )

Change subject: IMPALA-9922: A better approach to deal with date's sub-second 
fractions
..


Patch Set 3:

> Patch Set 3:
>
> (1 comment)
Thank you so much, I am working on it


--
To view, visit http://gerrit.cloudera.org:8080/16869
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8e870bb8ad8fd14d388f37dfc5175589ecf9a5a7
Gerrit-Change-Number: 16869
Gerrit-PatchSet: 3
Gerrit-Owner: fifteencai 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: fifteencai 
Gerrit-Comment-Date: Sat, 26 Dec 2020 13:40:13 +
Gerrit-HasComments: No