[ 
https://issues.apache.org/jira/browse/HIVE-29541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-29541:
---------------------------------------
    Attachment: col_stats_part_ndv.q

> Imprecise NDV stats on Iceberg partition columns
> ------------------------------------------------
>
>                 Key: HIVE-29541
>                 URL: https://issues.apache.org/jira/browse/HIVE-29541
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: col_stats_part_ndv.q
>
>
> The number of distinct values (NDV/countDistinct) statistic is slightly off 
> for Iceberg partition columns.  Currently, the NDV stats for Iceberg 
> (partition and regular) columns is computed by aggregating the individual 
> stats from each partition. The aggregation logic is subject to a small margin 
> of error since there is no way to have a fully accurate result from the 
> moment that we rely on probabilistic data structures (i.e., HyperLogLog).
> However, for partition columns we know exactly how many partitions are 
> present in the table so we don't need to rely on probabilistic data 
> structures since the NDV is equal to the number of partitions (no complex 
> aggregation needed). The StatsUtils class already contains some logic 
> (getColStatsForPartCol) to compute the NDV along with some other stats 
> directly from partitions but this does not kick in for Iceberg tables.
> The problem can be seen also in qtests after loading the LINEITEM table from 
> TPC_0_001 database in an Iceberg table using the L_ORDERKEY as a partition 
> key and running DESCRIBE FORMATTED on the partitioning column.
> {code:sql}
> DESC FORMATTED ice.lineitem l_orderkey
> {code}
> {noformat}
> col_name              L_ORDERKEY          
> data_type             int                 
> min                   1                   
> max                   5988                
> num_nulls             0                   
> distinct_count        1523                
> avg_col_len                               
> max_col_len                               
> num_trues                                 
> num_falses                                
> bit_vector            HL                  
> comment                                   
> COLUMN_STATS_ACCURATE 
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"l_comment\":\"true\",\"l_commitdate\":\"true\",\"l_discount\":\"true\",\"l_extendedprice\":\"true\",\"l_linenumber\":\"true\",\"l_linestatus\":\"true\",\"l_orderkey\":\"true\",\"l_partkey\":\"true\",\"l_quantity\":\"true\",\"l_receiptdate\":\"true\",\"l_returnflag\":\"true\",\"l_shipdate\":\"true\",\"l_shipinstruct\":\"true\",\"l_shipmode\":\"true\",\"l_suppkey\":\"true\",\"l_tax\":\"true\"}}
> {noformat}
> Observe that distinct_count (NDV) is 1523 while the real number is 1500. In 
> non-Iceberg tables the NDV is accurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to