[
https://issues.apache.org/jira/browse/HIVE-28581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18017352#comment-18017352
]
Vikram Ahuja commented on HIVE-28581:
-------------------------------------
[~dkuzmenko] , thanks for working on this.
Is there any performance benchmark(TPCDS for instance) on some dataset to see
the change in performance after this patch?
> Support Partition Pruning stats optimization for Iceberg tables
> ---------------------------------------------------------------
>
> Key: HIVE-28581
> URL: https://issues.apache.org/jira/browse/HIVE-28581
> Project: Hive
> Issue Type: Improvement
> Reporter: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Add support for Iceberg partition prune stats optimization
> {code}
> create external table ice01 (`i` int, `t` timestamp)
> partitioned by (year int, month int, day int)
> stored by iceberg tblproperties ('format-version'='2',
> 'write.summary.partition-limit'='10');
> insert into ice01 (i, year, month, day) values
> (1, 2023, 10, 3),
> (2, 2023, 10, 3),
> (2, 2023, 10, 3),
> (3, 2023, 10, 4),
> (4, 2023, 10, 4);
> {code}
> explain
> select i from ice01 where year=2023 and month = 10 and day = 3;
> {code}
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@ice01
> POSTHOOK: Input: default@ice01@year=2023/month=10/day=3
> POSTHOOK: Output: hdfs://### HDFS PATH ###
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 depends on stages: Stage-1
> STAGE PLANS:
> Stage: Stage-1
> Tez
> #### A masked pattern was here ####
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: ice01
> filterExpr: ((year = 2023) and (month = 10) and (day = 3))
> (type: boolean)
> Statistics: Num rows: 3 Data size: 48 Basic stats: COMPLETE
> Column stats: NONE
> Filter Operator
> predicate: ((year = 2023) and (month = 10) and (day = 3))
> (type: boolean)
> Statistics: Num rows: 3 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: i (type: int)
> outputColumnNames: _col0
> Statistics: Num rows: 3 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 3 Data size: 48 Basic stats:
> COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.SequenceFileInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Stage: Stage-0
> Fetch Operator
> limit: -1
> Processor Tree:
> ListSink
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)