[ 
https://issues.apache.org/jira/browse/HIVE-29541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18070607#comment-18070607
 ] 

Stamatis Zampetakis commented on HIVE-29541:
--------------------------------------------

A full repro of the problem along that shows also the impact on the query plan 
can be found in [^col_stats_part_ndv.q]

While digging into the problem I noticed that 
{{org.apache.hadoop.hive.ql.metadata.Table#isPartitionKey}} along with some 
other methods in the same class do not reflect the partitioning information for 
Iceberg tables. Since these methods are used all over the ql (compiler) module 
I found it a good idea to modify them so that partition keys are also reflected 
correctly for Iceberg tables. Unfortunately, various other failures appeared 
along the way and as it seems that any change in Table API to accommodate for 
Iceberg partitions can break many other places that rely on some other 
assumptions. Adding Iceberg specific fixes outside the Table API could work but 
that would make the code rather brittle so I am reluctant to go into this 
direction.

Due to lack of time I am pausing the investigation for the moment so if someone 
else wants to pick up this ticket feel free to do so.

> Imprecise NDV stats on Iceberg partition columns
> ------------------------------------------------
>
>                 Key: HIVE-29541
>                 URL: https://issues.apache.org/jira/browse/HIVE-29541
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: col_stats_part_ndv.q
>
>
> The number of distinct values (NDV/countDistinct) statistic is slightly off 
> for Iceberg partition columns.  Currently, the NDV stats for Iceberg 
> (partition and regular) columns is computed by aggregating the individual 
> stats from each partition. The aggregation logic is subject to a small margin 
> of error since there is no way to have a fully accurate result from the 
> moment that we rely on probabilistic data structures (i.e., HyperLogLog).
> However, for partition columns we know exactly how many partitions are 
> present in the table so we don't need to rely on probabilistic data 
> structures since the NDV is equal to the number of partitions (no complex 
> aggregation needed). The StatsUtils class already contains some logic 
> (getColStatsForPartCol) to compute the NDV along with some other stats 
> directly from partitions but this does not kick in for Iceberg tables.
> The problem can be seen also in qtests after loading the LINEITEM table from 
> TPC_0_001 database in an Iceberg table using the L_ORDERKEY as a partition 
> key and running DESCRIBE FORMATTED on the partitioning column.
> {code:sql}
> DESC FORMATTED ice.lineitem l_orderkey
> {code}
> {noformat}
> col_name              L_ORDERKEY          
> data_type             int                 
> min                   1                   
> max                   5988                
> num_nulls             0                   
> distinct_count        1523                
> avg_col_len                               
> max_col_len                               
> num_trues                                 
> num_falses                                
> bit_vector            HL                  
> comment                                   
> COLUMN_STATS_ACCURATE 
> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"l_comment\":\"true\",\"l_commitdate\":\"true\",\"l_discount\":\"true\",\"l_extendedprice\":\"true\",\"l_linenumber\":\"true\",\"l_linestatus\":\"true\",\"l_orderkey\":\"true\",\"l_partkey\":\"true\",\"l_quantity\":\"true\",\"l_receiptdate\":\"true\",\"l_returnflag\":\"true\",\"l_shipdate\":\"true\",\"l_shipinstruct\":\"true\",\"l_shipmode\":\"true\",\"l_suppkey\":\"true\",\"l_tax\":\"true\"}}
> {noformat}
> Observe that distinct_count (NDV) is 1523 while the real number is 1500. In 
> non-Iceberg tables the NDV is accurate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to