[ 
https://issues.apache.org/jira/browse/IMPALA-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-11969:
------------------------------------
    Labels: catalog-2024  (was: )

> Move incremental statistics out of the partition params to reduce partition's 
> memory footprint
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11969
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11969
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog, Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Aman Sinha
>            Assignee: Sai Hemanth Gantasala
>            Priority: Major
>              Labels: catalog-2024
>
> Impala stores incremental stats ([1] and [2]) within the partition params of 
> each partition object. The total bytes consumed by the incremental stats is 
> estimated as:
> {noformat}
> 200 bytes per column * num_columns * num_partitions  bytes
> {noformat}
> This means that for wide tables the partition object can get bloated and for 
> a few thousand partitions, this adds up.  It also affects Hive because the 
> same partition objects are fetched by Hive and can lead to memory pressure 
> even though Hive does not need those incremental stats. 
> The intent of the partition parameters in a Partition object was primarily to 
> keep certain properties or small objects.  We should move incremental stats 
> out of the partition params and store them separately in HMS. There is a 
> PART_COL_STATS in HMS that could potentially be used (with some redesign of 
> the schema) or a separate table could be added. 
> Additional notes:
> * Only the latest version of incremental stats is stored. They are stored in 
> partition parameters using keys like "impala_intermediate_stats_num_chunks" 
> and "impala_intermediate_stats_chunk0", "impala_intermediate_stats_chunk1", 
> "impala_intermediate_stats_chunk2", etc. The chunks are used to cap each 
> value size to be < 4000.   [3]
> * Impala compresses the incr stats of all columns into a byte array for each 
> partition. It'd be nice if HMS can compress them as well to save space.
> [1] 
> https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L261
> [2] 
> https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L232
> [3] 
> https://github.com/apache/impala/blob/419aa2e30db326f02e9b4ec563ef7864e82df86e/fe/src/main/java/org/apache/impala/catalog/PartitionStatsUtil.java#L182



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to