[ https://issues.apache.org/jira/browse/IMPALA-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314370#comment-17314370 ]
ASF subversion and git services commented on IMPALA-10494: ---------------------------------------------------------- Commit 1231208da7104c832c13f272d1e5b8f554d29337 in impala's branch refs/heads/master from Qifan Chen [ https://gitbox.apache.org/repos/asf?p=impala.git;h=1231208 ] IMPALA-10494: Making use of the min/max column stats to improve min/max filters This patch adds the functionality to compute the minimal and the maximal value for column types of integer, float/double, date, or decimal for parquet tables, and to make use of the new stats to discard min/max filters, in both hash join builders and Parquet scanners, when their coverage are too close to the actual range defined by the column min and max. The computation and dislay of the new column min/max stats can be controlled by two new Boolean query options (default to false): 1. compute_column_minmax_stats 2. show_column_minmax_stats Usage examples. set compute_column_minmax_stats=true; compute stats tpcds_parquet.store_sales; set show_column_minmax_stats=true; show column stats tpcds_parquet.store_sales; +-----------------------+--------------+-...-------+---------+---------+ | Column | Type | #Falses | Min | Max | +-----------------------+--------------+-...-------+---------+---------+ | ss_sold_time_sk | INT | -1 | 28800 | 75599 | | ss_item_sk | BIGINT | -1 | 1 | 18000 | | ss_customer_sk | INT | -1 | 1 | 100000 | | ss_cdemo_sk | INT | -1 | 15 | 1920797 | | ss_hdemo_sk | INT | -1 | 1 | 7200 | | ss_addr_sk | INT | -1 | 1 | 50000 | | ss_store_sk | INT | -1 | 1 | 10 | | ss_promo_sk | INT | -1 | 1 | 300 | | ss_ticket_number | BIGINT | -1 | 1 | 240000 | | ss_quantity | INT | -1 | 1 | 100 | | ss_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 | | ss_list_price | DECIMAL(7,2) | -1 | -1 | -1 | | ss_sales_price | DECIMAL(7,2) | -1 | -1 | -1 | | ss_ext_discount_amt | DECIMAL(7,2) | -1 | -1 | -1 | | ss_ext_sales_price | DECIMAL(7,2) | -1 | -1 | -1 | | ss_ext_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 | | ss_ext_list_price | DECIMAL(7,2) | -1 | -1 | -1 | | ss_ext_tax | DECIMAL(7,2) | -1 | -1 | -1 | | ss_coupon_amt | DECIMAL(7,2) | -1 | -1 | -1 | | ss_net_paid | DECIMAL(7,2) | -1 | -1 | -1 | | ss_net_paid_inc_tax | DECIMAL(7,2) | -1 | -1 | -1 | | ss_net_profit | DECIMAL(7,2) | -1 | -1 | -1 | | ss_sold_date_sk | INT | -1 | 2450816 | 2452642 | +-----------------------+--------------+-...-------+---------+---------+ Only the min/max values for non-partition columns are stored in HMS. The min/max values for partition columns are computed in coordinator. The min-max filters, in C++ class or protobuf form, are augmented to deal with the always true state better. Once always true is set, the actual min and max values in the filter are no longer populated. Testing: - Added new compute/show stats tests in compute-stats-column-minmax.test; - Added new tests in overlap_min_max_filters.test to demonstrate the usefulness of column stats to quickly disable useless filters in both hash join builder and Parquet scanner; - Added tests in min-max-filter-test.cc to demonstrate method Or(), ToProtobuf() and constructor can deal with always true flag well; - Tested with TPCDS 3TB to demonstrate the usefulness of the min and max column stats in disabling min/max filters that are not useful. - core tests. TODO: 1. IMPALA-10602: Intersection of multiple min/max filters when applying to common equi-join columns; 2. IMPALA-10601: Creating lineitem_orderkey_only table in tpch_parquet database; 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg tables with Parquet data files; 4. IMPALA-10617: Compute min/max column stats beyond parquet tables. Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df Reviewed-on: http://gerrit.cloudera.org:8080/17075 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Making use of the min/max column stats to improve min/max filters > ----------------------------------------------------------------- > > Key: IMPALA-10494 > URL: https://issues.apache.org/jira/browse/IMPALA-10494 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Reporter: Qifan Chen > Priority: Major > > HMS (hive metastore) API offers means to store the minimal and maximal value > per column > (https://hive.apache.org/javadocs/r3.0.0/api/org/apache/hadoop/hive/metastore/api/ColumnStatisticsData.html). > For example, such stats for an integer column can be captured via a > LongColumnStatsData object > (https://hive.apache.org/javadocs/r3.0.0/api/org/apache/hadoop/hive/metastore/api/LongColumnStatsData.html). > > It is desirable to use the min and max stats per column to help the formation > of useful min/max filters that can help reduce the data scanned for Parquet > tables. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org