Hi All, In HIVE-3959, I'm working actively on guaranteeing accuracy of physical stats. For context, the status quo in Hive is that both Table stats and Partition stats exist but are quite unreliable (even with hive.stats.reliable set to true). Either stats should be reliable or they should not exist. At Facebook, the approach we want to pursue is to guarantee accuracy for Partitions and Unpartitioned Tables and drop Partitioned Tables completely from the stats loop. The stats for the latter can always be computed on the fly from its individual Partitions if needed. In the patch I'm working on, I have moved the computation of physical stats (numFiles and totalSize) out of StatsTask and into the MetaStore entry points like "add partition", "create table", etc.
While technically it is possible to maintain Partitioned Table stats by adding update logic at partition-level operations in the MetaStore, that will introduce additional data fetching and inefficiencies in that layer. Moreover, it is not enough to just add this logic. To be able to truly guarantee accuracy, we will need to account for concurrency in adding/dropping of partitions to a table. We will have to lock the Table object for Partition-level operations. This will likely be unacceptable in most Hive installations. We feel that maintaining Table stats accurately for Partitioned tables is sufficiently onerous that the better option is to compute them on the fly when needed. I'm sending this email out so that if someone really needs stats parameters to be present for partitioned tables, they can speak up. Any comments on the jira are welcome. Thanks! Bhushan