Hi All,

In HIVE-3959, I'm working actively on guaranteeing accuracy of  physical stats. 
For context, the status quo in Hive is that both Table stats and Partition 
stats exist but are quite unreliable (even with hive.stats.reliable set to 
true). Either stats should be reliable or they should not exist. At Facebook, 
the approach we want to pursue is to guarantee accuracy for Partitions and 
Unpartitioned Tables and drop Partitioned Tables completely from the stats 
loop. The stats for the latter can always be computed  on the fly from its 
individual Partitions if needed. In the patch I'm working on, I have moved the 
computation of physical stats (numFiles and totalSize) out of StatsTask and 
into the MetaStore entry points like "add partition", "create table", etc.

While technically it is possible to maintain Partitioned Table stats by adding 
update logic at partition-level operations in the MetaStore, that will 
introduce additional data fetching and inefficiencies in that layer. Moreover, 
it is not enough to just add this logic. To be able to truly guarantee 
accuracy, we will need to account for concurrency in adding/dropping of 
partitions to a table.  We will have to lock the Table object for 
Partition-level operations. This will likely be unacceptable in most Hive 
installations. We feel that maintaining Table stats accurately for Partitioned 
tables is sufficiently onerous that the better option is to compute them on the 
fly when needed. I'm sending this email out so that if someone really needs 
stats parameters to be present for partitioned tables, they can speak up. Any 
comments on the jira are welcome. Thanks!

Bhushan

Reply via email to