John Sherman created HIVE-28346: ----------------------------------- Summary: Make ALTER CHANGE COLUMN more efficient with many partitions Key: HIVE-28346 URL: https://issues.apache.org/jira/browse/HIVE-28346 Project: Hive Issue Type: Improvement Components: HiveServer2, Metastore Reporter: John Sherman
Currently by default when a column is renamed, its column stats are renamed and maintained too via updateOrGetPartitionColumnStats() However; in the case of a partitioned table this gets updated per partition, rather than via a bulk operation - [https://github.com/apache/hive/blob/1c9969a003b09abc851ae7e19631ad208d3b6066/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L452] So a table with N partitions, will end up making at least N HMS calls (one per partition ) for a CHANGE COLUMN. This can take many minutes/hours for large partitioned tables, up to even hitting various timeouts. Ideally - it should be able to make a single HMS or update call via direct SQL to update all the partitions at once. We do have a work around for this: {code:java} COLSTATS_RETAIN_ON_COLUMN_REMOVAL("metastore.colstats.retain.on.column.removal", "hive.metastore.colstats.retain.on.column.removal", true, "Whether to retain column statistics during column removals in partitioned tables - disabling this purges all column statistics data for all partition to retain working consistency"),{code} However, this has some downsides: 1) It is set to retain stats by default 2) It affects all tables if enabled 3) It drops ALL column stats and not just the column being renamed. 4) It is not clear to users that this configuration will solve their issue (which presents typically as a ALTER CHANGE COLUMN operation timing out or taking a very long time). Ideally we could make an API for bulk updates to partition objects that is much more efficient. Another approach could be to add a threshold configuration that if the number of partitions is > then some configured value ALTER would drop the column stats, and under it would retain. -- This message was sent by Atlassian Jira (v8.20.10#820010)