Ashutosh Bapat created HIVE-21037:
-------------------------------------

             Summary: Replicate column statistics for Hive tables
                 Key: HIVE-21037
                 URL: https://issues.apache.org/jira/browse/HIVE-21037
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2
            Reporter: Ashutosh Bapat
            Assignee: Ashutosh Bapat


Statistics is important for query optimizations and thus keeping those 
up-to-date on replica is important from query performance perspective. The 
statistics are collected by scanning a table entirely. Thus when the data is 
replicated a. we could update the statistics by scanning it on replica or b. we 
could just replicate the statistics also. For following reasons we desire to go 
by the second approach instead of the first.
 # Scanning the data on replica isn’t a good option since it wastes CPU cycles 
and puts load during replication, which can be significant.
 # Storages like S3 may not have compute capabilities and thus when we are 
replicating from on-prem to cloud, we can not rely on the target to gather 
statistics.
 # For ACID tables, the statistics should be associated with the snapshot. This 
means the statistics collection on target should sync with the write-id on the 
source since target doesn't generate target ids of its own.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to