[
https://issues.apache.org/jira/browse/HIVE-29541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-29541:
---------------------------------------
Attachment: col_stats_part_ndv.q
> Imprecise NDV stats on Iceberg partition columns
> ------------------------------------------------
>
> Key: HIVE-29541
> URL: https://issues.apache.org/jira/browse/HIVE-29541
> Project: Hive
> Issue Type: Bug
> Components: Statistics
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Attachments: col_stats_part_ndv.q
>
>
> The number of distinct values (NDV/countDistinct) statistic is slightly off
> for Iceberg partition columns. Currently, the NDV stats for Iceberg
> (partition and regular) columns is computed by aggregating the individual
> stats from each partition. The aggregation logic is subject to a small margin
> of error since there is no way to have a fully accurate result from the
> moment that we rely on probabilistic data structures (i.e., HyperLogLog).
> However, for partition columns we know exactly how many partitions are
> present in the table so we don't need to rely on probabilistic data
> structures since the NDV is equal to the number of partitions (no complex
> aggregation needed). The StatsUtils class already contains some logic
> (getColStatsForPartCol) to compute the NDV along with some other stats
> directly from partitions but this does not kick in for Iceberg tables.
> The problem can be seen also in qtests after loading the LINEITEM table from
> TPC_0_001 database in an Iceberg table using the L_ORDERKEY as a partition
> key and running DESCRIBE FORMATTED on the partitioning column.
> {code:sql}
> DESC FORMATTED ice.lineitem l_orderkey
> {code}
> {noformat}
> col_name L_ORDERKEY
> data_type int
> min 1
> max 5988
> num_nulls 0
> distinct_count 1523
> avg_col_len
> max_col_len
> num_trues
> num_falses
> bit_vector HL
> comment
> COLUMN_STATS_ACCURATE
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"l_comment\":\"true\",\"l_commitdate\":\"true\",\"l_discount\":\"true\",\"l_extendedprice\":\"true\",\"l_linenumber\":\"true\",\"l_linestatus\":\"true\",\"l_orderkey\":\"true\",\"l_partkey\":\"true\",\"l_quantity\":\"true\",\"l_receiptdate\":\"true\",\"l_returnflag\":\"true\",\"l_shipdate\":\"true\",\"l_shipinstruct\":\"true\",\"l_shipmode\":\"true\",\"l_suppkey\":\"true\",\"l_tax\":\"true\"}}
> {noformat}
> Observe that distinct_count (NDV) is 1523 while the real number is 1500. In
> non-Iceberg tables the NDV is accurate.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)