[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 30: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 02 Apr 2021 21:50:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 30:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7044/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:58:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 30: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:58:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 29: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 02 Apr 2021 15:56:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-02 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 29: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98:
> Good catch. Did a unit test with the hudi table. All seems good. I think it
Thanks for trying it out. I'm OK with the current state, HUDI_PARQUET is just 
PARQUET for Impala.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 02 Apr 2021 08:06:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 29:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8493/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 29
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 01 Apr 2021 17:46:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-01 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#29). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-01 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 28:

(3 comments)

Rework.

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:   }
> > Looks like mixture of files of different format (like Parquet and ORC at
Reworked method hasAtLeastOneParquetPartition() to iterate until a Parquet 
partition is met.

DONE.


http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@387
PS28, Line 387:   /*
> Can you remove this method.
Done


http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98: # Create a new hudi parquet table.
> Please note that it creates a plain Parquet table from the data in the Hudi
Good catch. Did a unit test with the hudi table. All seems good. I think it 
probably is okay to leave out the hudi test for now.

Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build 
e45818b616ab4ff81fa512c8257bc8fac594094a)
COMPUTE_COLUMN_MINMAX_STATS set to true
Query: compute stats functional_parquet.hudi_non_partitioned
+--+
| summary  |
+--+
| Updated 1 partition(s) and 15 column(s). |
+--+
Fetched 1 row(s) in 0.53s
[12:57:17 qchen@qifan-10229: IMPALA-10494_making_use_of_minmax_column_stats] 
sql dml.showstats.hudi
Starting Impala Shell with no authentication using Python 2.7.16
Warning: live_progress only applies to interactive shell sessions, and is being 
skipped for now.
Opened TCP connection to localhost:21000
Connected to localhost:21000
Server version: impalad version 4.0.0-SNAPSHOT DEBUG (build 
e45818b616ab4ff81fa512c8257bc8fac594094a)
SHOW_COLUMN_MINMAX_STATS set to true
Query: show column stats functional_parquet.hudi_non_partitioned
++---+--++--+---++-+--++
| Column | Type  | #Distinct 
Values | #Nulls | Max Size | Avg Size  | #Trues | #Falses | Min 
 | Max|
++---+--++--+---++-+--++
| _hoodie_commit_time| STRING| 2
| 0  | 14   | 14| -1 | -1  | -1 
  | -1 |
| _hoodie_commit_seqno   | STRING| 97   
| 0  | 20   | 20| -1 | -1  | -1 
  | -1 |
| _hoodie_record_key | STRING| 99   
| 0  | 36   | 36| -1 | -1  | -1 
  | -1 |
| _hoodie_partition_path | STRING| 3
| 0  | 25   | 25| -1 | -1  | -1 
  | -1 |
| _hoodie_file_name  | STRING| 6
| 0  | 71   | 70.6830517578 | -1 | -1  | -1 
  | -1 |
| _hoodie_is_deleted | BOOLEAN   | 2
| 0  | 1| 1 | 0  | 100 | -1 
  | -1 |
| _row_key   | STRING| 99   
| 0  | 36   | 36| -1 | -1  | -1 
  | -1 |
| begin_lat  | DOUBLE| 100  
| 0  | 8| 8 | -1 | -1  | 
0.013803214965246391 | 0.9973157077943435 |
| begin_lon  | DOUBLE| 99   
| 0  | 8| 8 | -1 | -1 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-04-01 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 28: Code-Review+1

(1 comment)

LGTM, but had a comment about testing Hudi tables

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/28/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@98
PS28, Line 98: # Create a new hudi parquet table.
Please note that it creates a plain Parquet table from the data in the Hudi 
table. Unfortunately Impala cannot write Hudi tables.

I think we have the following options here:
* Add a hudi table to the repo only for this test. But it needs some work to 
figure out how to do that
* Use functional_parquet.hudi_non_partitioned and hope it doesn't break other 
tests. But we cannot guarantee it won't break future tests.
* Create a Jira ticket to add Hudi testing later



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 01 Apr 2021 11:57:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-31 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 28: Code-Review+1

(2 comments)

Just a nit and a comment. Should be able to +2 after that.

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:   }
> Looks like mixture of files of different format (like Parquet and ORC at the 
> same time) is not allowed.
This may not be accurate.  In the HdfsScanNode's computeScanRangeLocation(), we 
examine the file format for a partition and just record it but just set a flag 
if they are not all parquet format. It does not assert or return error.
Here's the snippet in HdfsScanNode.java:
  fileFormats_.add(partition.getFileFormat());
  if (!isParquetBased(partition.getFileFormat())) {
allParquet = false;
  }
However, for statistics, as I mentioned before (and you  seem in agreement) 
that having at least one parquet partition is sufficient to trigger the min-max 
filter checks since it does not affect correctness of results.


http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/17075/28/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@387
PS28, Line 387:   /*
Can you remove this method.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 31 Mar 2021 21:39:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 28:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8460/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 28
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 29 Mar 2021 18:42:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-29 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#28). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files.
 4. IMPALA-10617: Compute min/max column stats beyond the parquet tables

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-29 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 27: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 29 Mar 2021 12:52:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-25 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 27:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> I guess we have quite a bunch of .test files that are executed against all
Ok, yeah it makes sense that the same .test file is used for different 
formats..so will involve updating several of them (just for reference, I 
checked under testdata:
   testdata/workloads$ grep -r -i "show column stats" | cut -d ":" -f 1 | uniq 
| wc -l
15
It shows 15 files).

On extending the min/max to other formats, we could probably consider in the 
future .. any such stats should ideally be leveraged by either the optimizer 
(for selectivity estimate) or runtime rather than be only for informational 
purpose.

My other thought on this is that showing min/max stats should ideally be 
controlled through a syntax improvement.. e.g SHOW COLUMN STATS WITH MINMAX 
  rather than a query option.
But in the interest of moving this patch forward, I am ok with deferring such 
work.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 25 Mar 2021 17:37:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-25 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 19:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> Did we arrive at a consensus on this ?  Adding new query options means we h
I guess we have quite a bunch of .test files that are executed against all 
kinds of file formats, and the tests require to have the same result data in 
each case.

The other option is to calculate min/max for every file format. Since we only 
do it for numeric values I don't think it would add too much overhead. We 
couldn't use them during query execution (only for Parquet and later ORC?), but 
might be still useful for a user if they want to see some basic stats of a 
table.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279
PS27, Line 279:   public boolean isParquetBased() {
> nit: There's an identical method in HdfsScanNode.java which is currently pr
Yeah, HUDI_PARQUET means Hudi table format + Parquet data files, therefore the 
min/max filters will work fine on them.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 25 Mar 2021 16:26:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-25 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 27:

(3 comments)

Was getting ready to +2 this but just had couple of additional questions and a 
suggestion.

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
> On paper, yes we should be able to show the min/max. On the other hand, the
Did we arrive at a consensus on this ?  Adding new query options means we have 
to document them.  It seems option 2 could be deprecated in the future as soon 
as tests are updated so I would vote for not documenting it for now. I am also 
not clear why many tests have to be updated.. if non-parquet formats don't have 
this stats stored in HMS, why would it affect the display ? We could just skip 
the last 2 columns.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@260
PS27, Line 260:   for (FeFsPartition partition : allPartitions) {
it's unfortunate that we have to iterate through all partitions (could be tens 
of thousands) to check this..and I am not sure if at this point the how many 
rpcs the loadAllPartitions() will make to the catalog server. In a vast 
majority of cases all partitions will be of the same format.  My question is 
..suppose you find at least 1 parquet format and return true for computeMinMax 
..doesn't it just mean that whatever min/max stats is computed will be off 
(similar to stale stats that we discussed earlier) .. so in theory it will 
reduce the effectiveness of the filter but not affect correctness.
I am thinking of use cases with lots of partitions where users have COMPUTE 
STATS in their ETL scripts and suppose they enable compute_min_max_stats, this 
will slow down the compute stats noticeably.


http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/27/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@279
PS27, Line 279:   public boolean isParquetBased() {
nit: There's an identical method in HdfsScanNode.java which is currently 
private (see below) but it would be good to either have that method call this 
one or have a public static isParquetBased(HdfsFileFormat) defined somewhere.

https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java#L387

Also, I suppose HUDI_PARQUET min/max stats should work with this patch ?  I 
don't think a test has been added for that. The fact that the other method in 
HdfsScanNode treats both somewhat equivalent leads me to think it should work.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 25 Mar 2021 06:14:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-24 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 27:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8432/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 27
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 24 Mar 2021 14:34:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-24 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#27). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 26:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8428/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 26
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Wed, 24 Mar 2021 02:44:18 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#26). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 25:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8425/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 23 Mar 2021 19:46:26 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 25:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/25/tests/query_test/test_runtime_filters.py
File tests/query_test/test_runtime_filters.py:

http://gerrit.cloudera.org:8080/#/c/17075/25/tests/query_test/test_runtime_filters.py@287
PS25, Line 287: u
flake8: F821 undefined name 'unique_database'



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 25
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 23 Mar 2021 19:26:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#25). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tests with TPCDS 3TB which demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-23 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 24:

(9 comments)

lgtm, mostly found nits.

http://gerrit.cloudera.org:8080/#/c/17075/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/24//COMMIT_MSG@74
PS24, Line 74:  1. Enable the feature for Iceberg tables with Parquet data 
files.
Do you plan to resolve this TODO in this patch, or in a follow-up Jira? If the 
latter, please create a new Jira ticket and refer to the ticket id here.


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc@432
PS24, Line 432: // in the filter is too close to the column min/max.
nit: please extend comment with in what cases we return true/false.


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/filter-context.cc@433
PS24, Line 433: OverlapWithColumnLowAndHighValue
nit: this name doesn't really tell what this function does. Maybe 
'ShouldRejectFilterBasedOnColumnStats()'?


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/exec/parquet/hdfs-parquet-scanner.cc@122
PS24, Line 122: NumColumnStatsRejectedRowGroups
nit: I think it's easy to misinterpret the name of this counter. It sounds like 
it counts the number of row groups rejected by column stats. Maybe a better 
name would be "NumUnusefulFiltersForRowGroups"?


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc
File be/src/service/hs2-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc@631
PS24, Line 631: l
nit: typo


http://gerrit.cloudera.org:8080/#/c/17075/24/be/src/service/hs2-util.cc@632
PS24, Line 632: must
nit: must be?


http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
File 
testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test:

http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test@4
PS24, Line 4: set compute_column_minmax_stats = true;
: compute stats functional_parquet.alltypes;
I'm not sure if it's a good idea to modify these tables in tests, even if it's 
just column stats. Other tests might depend on the stats being set/unset. 
Probably we should just copy these tables into a private unique database. Also, 
we could use alltypestiny instead of alltypes.


http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/24/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@8
PS24, Line 8: create database if not exists unique_database;
You can change TestOverlapMinMaxFilters:: test_all_runtime_filters to use a 
unique database by just adding a parameter to your test method then pass it to 
run_test_case(). Like you already do it in this patch set for 
test_hbase_compute_stats (though the method name shouldn't have hbase in it).


http://gerrit.cloudera.org:8080/#/c/17075/24/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/17075/24/tests/metadata/test_compute_stats.py@446
PS24, Line 446: _hbase_
this test is not related to HBase.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 23 Mar 2021 15:04:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 24:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8419/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 24
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Mar 2021 21:55:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-22 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#24). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, or date for parquet
tables, and to make use of the new stats to discard min/max filters, in
both hash join builders and Parquet scanners, whose coverage are too
close to the actual range defined by the column min and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-21 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 23: Code-Review+1

(4 comments)

This mostly looks good to me, so doing a +1. One pending question is about the 
DATE type..can you clarify?

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG@12
PS21, Line 12: both hash join builders and Parque
> Currently at the scanner level, the overlap is checked against the column i
Ok, so the worse that would happen if the stats are stale is that instead of 
marking it as alwaysTrue, we would keep the min-max filter and since the 
min-max filter is created based on the actual data values, it should be ok.
Agree about the TPC-DS use case for the scans since most of the columns have 
either randomly or uniformly distributed values across all row groups.


http://gerrit.cloudera.org:8080/#/c/17075/23/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/23/be/src/exec/filter-context.cc@477
PS23, Line 477: case PrimitiveType::TYPE_DATE:
I thought you added DATE in patch set 23 ?


http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10
PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey 
bigint)
> Sounds like a good idea. Due to the complexity of bin/generate-schema-state
I am ok with deferring this. Pls create a JIRA ticket.


http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@221
PS21, Line 221:  QUERY
> Another approach would be to join D1 and D2 first which will produce one 
> filter.
The join order decisions are made during the logical planning phase based on 
costing, so we would not change that for the min-max filters.  But yeah, we 
should create an enhancement JIRA for creating the intersection of the 2 or 
more min-max intervals. It is also easier to do compared to the bloom filter 
aggregation that is supported today that happens at the coordinator node.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 22 Mar 2021 02:30:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 23:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8411/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 23
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 19 Mar 2021 23:04:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-19 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#23). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, or date for parquet
tables, and to make use of the new stats to discard min/max filters, in
both hash join builders and Parquet scanners, whose coverage are too
close to the actual range defined by the column min and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for decimal columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-18 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 22:

Aman asked whether min/max filters are applied in the context of out-joins. The 
answer is no.  Please refer to TPCDS q49.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Mar 2021 20:27:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8394/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 22
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 18 Mar 2021 20:17:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-18 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#22). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 21:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/21//COMMIT_MSG@12
PS21, Line 12: join builders and Parquet scanners
Is it fair to say that the main value proposition of the column  min/max 
statistics is in the join builder ?  The Parquet scanners already have access 
to the row group's min/max stats per column, so it seems to me the stats coming 
from HMS will not add additional value there. But for the HJ builder, yes it 
helps by figuring out whether after the build phase let's say you have the 
range [10, 50] and the min/max stats fetched from HMS are [60, 100] then we can 
quickly say that the runtime min/max filter will exclude all row groups.

But what happens if the stats are out-of-date ? Since these stats are getting 
uses not just for ACID tables but for external tables as well, the stats are 
not bound to the table's valid write id (applicable only for ACID tables). This 
can lead to incorrect overlap calculation. Any thoughts ?


http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/21/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@221
PS21, Line 221:  QUERY
For the tests, it would be good to add a 3 table join case as well. My 
understanding is that if you have a fact table F and 2 dimension tables D1, D2 
and suppose F is joined to D1, D2 as follows:
 F. a1 = D1.a1 AND F.a1 = D2.a1  then suppose the first join's  range is  [10, 
50] and the second join's range is [20, 60]  then the filter seen at scan of F  
will be [20, 50].  Are such scenario already tested elsewhere ? If so, feel 
free to point me to those and mark this resolved.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 16 Mar 2021 03:07:30 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 21:

Thinking about min-max filters in the context of outer joins, I would expect 
they are _not_ pushed to the non-null producing side of an outer join (we have 
similar restriction for the runtime bloom filters). Is that handled somewhere 
(maybe in another patch) ?


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 15 Mar 2021 18:40:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Aman Sinha (Code Review)
Aman Sinha has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 21:

(5 comments)

I haven't analyzed how exactly the overlap of min-max filters works..so will go 
through the design doc and maybe the original commit.  In the meantime, sending 
some initial comments.

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc
File be/src/exec/catalog-op-executor.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/catalog-op-executor.cc@316
PS19, Line 316: or non-string
  :   // columns.
Update this since min/max stats are not getting computed for Decimal types.  
Also, are we sure that allowing float/double min-max filters is ok ? These are 
not precise data types.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@474
PS19, Line 474: case PrimitiveType::TYPE_TIMESTAMP:
The commit message says the min-max column stats is used for integer, float and 
double so I was not expecting to see timestamp and date types also considered 
here.  Shouldn't they just to the default case and return false ?


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc
File be/src/exec/incr-stats-util.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/incr-stats-util.cc@260
PS19, Line 260: apache::hive::service::cli::thrift::TColumnValue
It would be good to create a typedef for the hive version of TColumnValue.  
Also, the package name looks different from what I see in the source code here: 
https://github.com/apache/hive/blob/master/service-rpc/src/gen/thrift/gen-javabean/org/apache/hive/service/rpc/thrift/TColumnValue.java#L7


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@957
PS19, Line 957: VLOG(3) << "The filter is set to always true.";
nit: Would be useful to print that this was a min-max filter and probably the 
filter id.


http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
File 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17075/19/testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test@10
PS19, Line 10: CREATE TABLE unique_database.lineitem_orderkey_only(l_orderkey 
bigint)
Creating large tables during a test run adds unacceptable latency to the  
testing time.   Can the db and table creation steps not be moved into 
testdata/datasets/tpch/  as a schema template for min-max filters ?



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Aman Sinha 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 15 Mar 2021 18:14:25 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 21:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8371/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 21
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 15 Mar 2021 17:24:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#21). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Enable the feature for Iceberg tables with Parquet data files.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 20:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8369/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 20
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Mon, 15 Mar 2021 15:04:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-15 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#20). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-12 Thread Zoltan Borok-Nagy (Code Review)
Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 19:

(8 comments)

Did a quick walkthrough, will look into it in detail next week.

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17075/19//COMMIT_MSG@19
PS19, Line 19: show_column_minmax_stats
Do we need this query option? I mean if we have min/max stats then we'd 
probably want to show them.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc
File be/src/exec/filter-context.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@447
PS19, Line 447: (
nit: parentheses are not needed as the '.' member access takes precedence over 
the '&' adress-of operator,


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/filter-context.cc@492
PS19, Line 492: (
nit: parentheses not needed


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/hdfs-scanner.h@348
PS19, Line 348: uint8_t enabled_for_rowgroup;
Why do we need this flag? If enabled_for_rowgroup is false, then the min/max 
filter is completely turned off, right? In that case we shouldn't even send 
them to the scanner, or am I missing something?


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/parquet/hdfs-parquet-scanner.cc@662
PS19, Line 662: }
nit: EvaluateOverlapForRowGroup() is already quite long, maybe this code could 
go into a separate function.


http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/17075/19/be/src/exec/partitioned-hash-join-builder.cc@950
PS19, Line 950:
nit: indentation


http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@50
PS19, Line 50: LOG
seems unused


http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
File fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java:

http://gerrit.cloudera.org:8080/#/c/17075/19/fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java@278
PS19, Line 278:*/
It would be good to handle Iceberg tables that use Parquet data files.



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Fri, 12 Mar 2021 17:02:09 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 19:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8335/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 19
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 11 Mar 2021 01:20:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-10 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and the maximal
value for a column of type integers, float or double for parquet tables,
and to make use of the new stats to discard min/max filters, in both hash
join builders and Parquet scanners, whose coverage are too close to the
actual range defined by the column min and max.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8322/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 18
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 09 Mar 2021 19:09:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-09 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#18). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range in both hash join builders and Parquet scanners.

The computation and dislay of the new column min/max stats are done
for Parquet tables only and can be controlled by two new Boolean query
options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner.;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8318/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 17
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 09 Mar 2021 16:55:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-09 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#17). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range in both hash join builders and Parquet scanners.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 16:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8311/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 16
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Mon, 08 Mar 2021 21:20:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 15:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8310/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 15
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Mon, 08 Mar 2021 21:12:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-08 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#16). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter is no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-08 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#15). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the min
and max in the filter is no longer populated.

Testing:
 - Added new compute/show stats tests for integers, float and double
   column data types in compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well.
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/runtime/runtime-filter-bank.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter-test.cc
M be/src/util/min-max-filter.cc
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8296/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Wed, 03 Mar 2021 01:10:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8295/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Wed, 03 Mar 2021 01:02:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 14:

Fix incorrectly spelled query option name in overlap_min_max_filters.test.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 14
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:51:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#14). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest;
 - Add new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Tested compute/show stats for integers, float and double column data
   types;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Add logic to disable min/max filters inside HJ builder via
the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 13:

Added the following:
 1. New test overlap_min_max_filters.test
 2. Query option show_column_minmax_stats to facilitate shield existing column 
stats from format change
 3. A new counter NumColumnStatsRejectedRowGroups.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:45:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17075/13/tests/metadata/test_compute_stats.py
File tests/metadata/test_compute_stats.py:

http://gerrit.cloudera.org:8080/#/c/17075/13/tests/metadata/test_compute_stats.py@430
PS13, Line 430: class TestParquetComputeColumnMinMax(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 13
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Wed, 03 Mar 2021 00:43:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-02 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

When enabled, two new columns 'Min' and 'Max' are added in the output
of the show column command as shown below.

set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest;
 - Add new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters;
 - Tested compute/show stats for integers, float and double column data
   types;
 - core tests.

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Add logic to disable min/max filters inside HJ builder via
the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/Frontend.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/analysis/ShowStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
A 

[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8279/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 12
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Tue, 02 Mar 2021 03:58:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-01 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#12). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integers, float or double
for parquet tables, and to make use of the new stats to discard
the min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
M 
testdata/workloads/functional-query/queries/QueryTest/overlap_min_max_filters.test
26 files changed, 847 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/12
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 12
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8270/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Mon, 01 Mar 2021 21:33:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-01 Thread Qifan Chen (Code Review)
Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 11:

Also add the query option compute_column_min_max_stats. Default to 'false'.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Mon, 01 Mar 2021 21:14:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-03-01 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#11). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer and double
for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
the column stats.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/ImpalaInternalService.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
25 files changed, 827 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/11
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 11
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-02-27 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
or date for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for float, timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
the column stats;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/scan-node.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M be/src/util/min-max-filter.h
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
22 files changed, 826 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/10
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 10
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-02-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8243/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen 
Gerrit-Comment-Date: Thu, 25 Feb 2021 22:27:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10494: Making use of the min/max column stats to improve min/max filters

2021-02-25 Thread Qifan Chen (Code Review)
Qifan Chen has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/17075 )

Change subject: IMPALA-10494: Making use of the min/max column stats to improve 
min/max filters
..

IMPALA-10494: Making use of the min/max column stats to improve min/max filters

This patch adds the functionality to compute the minimal and
the maximal value for a column of type integer, double, timestamp
or date for parquet tables, and to make use of the new stats to
discard min/max filters whose coverage are too close to the actual
range.

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed when each
partition is processed in coordinator.

Two new columns 'Min' and 'Max' are added in the output of the
show column stats command as shown below.

show column stats tpcds_parquet.store_sales
+---+--+-...---+-+-+
| Column| Type |   #Falses | Min | Max |
+---+--+-...---+-+-+
| ss_sold_time_sk   | INT  |   -1  | 28800   | 75599   |
| ss_item_sk| BIGINT   |   -1  | 1   | 18000   |
| ss_customer_sk| INT  |   -1  | 1   | 10  |
| ss_cdemo_sk   | INT  |   -1  | 15  | 1920797 |
| ss_hdemo_sk   | INT  |   -1  | 1   | 7200|
| ss_addr_sk| INT  |   -1  | 1   | 5   |
| ss_store_sk   | INT  |   -1  | 1   | 10  |
| ss_promo_sk   | INT  |   -1  | 1   | 300 |
| ss_ticket_number  | BIGINT   |   -1  | 1   | 24  |
| ss_quantity   | INT  |   -1  | 1   | 100 |
| ss_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_sales_price| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_list_price | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_ext_tax| DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_coupon_amt | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_net_profit | DECIMAL(7,2) |   -1  | -1  | -1  |
| ss_sold_date_sk   | INT  |   -1  | 2450816 | 2452642 |
+---+--+-...---+-+-+

Testing:
 - Added TestLowAndHighValueShort and TestLowAndHighValueInt to
   IncrStatsUtilTest

TODO:
 1. Test compute stats for float, timestamp and date columns;
 2. Test filters being disabled at the scan node;
 3. Add logic to disable min/max filters inside HJ builder via
the column stats;

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
---
M be/src/exec/catalog-op-executor.cc
M be/src/exec/filter-context.cc
M be/src/exec/filter-context.h
M be/src/exec/hdfs-scanner.h
M be/src/exec/incr-stats-util-test.cc
M be/src/exec/incr-stats-util.cc
M be/src/exec/incr-stats-util.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/scan-node.cc
M be/src/service/hs2-util.cc
M be/src/service/hs2-util.h
M common/thrift/CatalogObjects.thrift
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java
M fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/java/org/apache/impala/util/MetaStoreUtil.java
21 files changed, 816 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/75/17075/9
--
To view, visit http://gerrit.cloudera.org:8080/17075
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Gerrit-Change-Number: 17075
Gerrit-PatchSet: 9
Gerrit-Owner: Qifan Chen 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Qifan Chen