John Sherman created HIVE-28346:
-----------------------------------

             Summary: Make ALTER CHANGE COLUMN more efficient with many 
partitions
                 Key: HIVE-28346
                 URL: https://issues.apache.org/jira/browse/HIVE-28346
             Project: Hive
          Issue Type: Improvement
          Components: HiveServer2, Metastore
            Reporter: John Sherman


Currently by default when a column is renamed, its column stats are renamed and 
maintained too via updateOrGetPartitionColumnStats()

However; in the case of a partitioned table this gets updated per partition, 
rather than via a bulk operation -
[https://github.com/apache/hive/blob/1c9969a003b09abc851ae7e19631ad208d3b6066/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L452]
So a table with N partitions, will end up making at least N HMS calls (one per 
partition ) for a CHANGE COLUMN. This can take many minutes/hours for large 
partitioned tables, up to even hitting various timeouts.

Ideally - it should be able to make a single HMS or update call via direct SQL 
to update all the partitions at once.

We do have a work around for this:
{code:java}
 
COLSTATS_RETAIN_ON_COLUMN_REMOVAL("metastore.colstats.retain.on.column.removal",
        "hive.metastore.colstats.retain.on.column.removal", true,
        "Whether to retain column statistics during column removals in 
partitioned tables - disabling this purges all column statistics data for all 
partition to retain working consistency"),{code}

However, this has some downsides:
1) It is set to retain stats by default
2) It affects all tables if enabled
3) It drops ALL column stats and not just the column being renamed.
4) It is not clear to users that this configuration will solve their issue 
(which presents typically as a ALTER CHANGE COLUMN operation timing out or 
taking a very long time).

Ideally we could make an API for bulk updates to partition objects that is much 
more efficient. Another approach could be to add a threshold configuration that 
if the number of partitions is > then some configured value ALTER would drop 
the column stats, and under it would retain.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to