[ 
https://issues.apache.org/jira/browse/HIVE-28346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihua Deng resolved HIVE-28346.
--------------------------------
    Fix Version/s: 4.2.0
       Resolution: Fixed

> Make ALTER CHANGE COLUMN more efficient with many partitions
> ------------------------------------------------------------
>
>                 Key: HIVE-28346
>                 URL: https://issues.apache.org/jira/browse/HIVE-28346
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2, Metastore
>            Reporter: John Sherman
>            Assignee: Zhihua Deng
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> Currently by default when a column is renamed, its column stats are renamed 
> and maintained too via updateOrGetPartitionColumnStats()
> However; in the case of a partitioned table this gets updated per partition, 
> rather than via a bulk operation -
> [https://github.com/apache/hive/blob/1c9969a003b09abc851ae7e19631ad208d3b6066/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L452]
> So a table with N partitions, will end up making at least N HMS calls (one 
> per partition ) for a CHANGE COLUMN. This can take many minutes/hours for 
> large partitioned tables, up to even hitting various timeouts.
> Ideally - it should be able to make a single HMS or update call via direct 
> SQL to update all the partitions at once.
> We do have a work around for this:
> {code:java}
>  
> COLSTATS_RETAIN_ON_COLUMN_REMOVAL("metastore.colstats.retain.on.column.removal",
>         "hive.metastore.colstats.retain.on.column.removal", true,
>         "Whether to retain column statistics during column removals in 
> partitioned tables - disabling this purges all column statistics data for all 
> partition to retain working consistency"),{code}
> However, this has some downsides:
> 1) It is set to retain stats by default
> 2) It affects all tables if enabled
> 3) It drops ALL column stats and not just the column being renamed.
> 4) It is not clear to users that this configuration will solve their issue 
> (which presents typically as a ALTER CHANGE COLUMN operation timing out or 
> taking a very long time).
> Ideally we could make an API for bulk updates to partition objects that is 
> much more efficient. Another approach could be to add a threshold 
> configuration that if the number of partitions is > then some configured 
> value ALTER would drop the column stats, and under it would retain.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to