[ https://issues.apache.org/jira/browse/IMPALA-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alex Rodoni closed IMPALA-6632. ------------------------------- Resolution: Won't Fix > Document compatibility of table and column stats between Impala and Hive > ------------------------------------------------------------------------ > > Key: IMPALA-6632 > URL: https://issues.apache.org/jira/browse/IMPALA-6632 > Project: IMPALA > Issue Type: Improvement > Components: Docs > Reporter: Alexander Behm > Assignee: Alex Rodoni > Priority: Major > > The question of compatibility between the table and column stats between Hive > and Impala comes up quite often, so is worth documenting explicitly. > Quoting myself from a recent discussion thread to get the docs effort started: > Commonalities: > - Hive and Impala both store row counts at the table level and partition > level. Hive also computes and stores additional stats like file counts which > Impala does not need or use. > Differences: > - Impala computes and stores column-level stats like the number of distinct > values (NDV) only at the table level, and not at the partition level. > - Hive computes and stores column-level stats at the partition level. Impala > does not follow this approach because the per-partition NDVs cannot be > sensibly combined for queries that access multiple partitions. In short, the > column stats for partitioned tables are not compatible between Impala and > Hive (because imo Hive's approach does not make sense). > - Impala uses a more modern and tuned algorithm (HyperLogLog++) for > estimating the number of distinct values, so they tend to be more accurate > than Hive's. Your mileage may vary. > - For unpartitioned tables, the Hive and Impala column stats are compatible. > For partitioned tables, the table-level column stats that Impala writes in > the Metastore are stored just like for unpartitioned tables. These statistics > are "available" to Hive in the sense that the standard retrieval APIs will > work as expected. My understanding is that for partitioned tables, Hive does > not use the table-level column stats, but instead expects partition-level > column stats. As I've said before, these partition-level column stats do not > make any sense because it is not possible to sensibly combine them for > multiple partitions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)