[jira] [Updated] (IMPALA-11969) Move incremental statistics out of the partition params to reduce partition's memory footprint

Aman Sinha (Jira) Fri, 03 Mar 2023 10:44:08 -0800


     [ 
https://issues.apache.org/jira/browse/IMPALA-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Aman Sinha updated IMPALA-11969:
--------------------------------
    Description: 
Impala stores incremental stats ([1] and [2]) within the partition params of 
each partition object. The total bytes consumed by the incremental stats is 
estimated as:
{noformat}
200 bytes per column * num_columns * num_partitions  bytes
{noformat}
This means that for wide tables the partition object can get bloated and for a 
few thousand partitions, this adds up.  It also affects Hive because the same 
partition objects are fetched by Hive and can lead to memory pressure even 
though Hive does not need those incremental stats. 

The intent of the partition parameters in a Partition object was primarily to 
keep certain properties or small objects.  We should move incremental stats out 
of the partition params and store them separately in HMS. There is a 
PART_COL_STATS in HMS that could potentially be used (with some redesign of the 
schema) or a separate table could be added. 

Additional notes:
* Only the latest version of incremental stats is stored. They are stored in 
partition parameters using keys like "impala_intermediate_stats_num_chunks" and 
"impala_intermediate_stats_chunk0", "impala_intermediate_stats_chunk1", 
"impala_intermediate_stats_chunk2", etc. The chunks are used to cap each value 
size to be < 4000.   [3]
* Impala compresses the incr stats of all columns into a byte array for each 
partition. It'd be nice if HMS can compress them as well to save space.


[1] 
https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L261
[2] 
https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L232
[3] 
https://github.com/apache/impala/blob/419aa2e30db326f02e9b4ec563ef7864e82df86e/fe/src/main/java/org/apache/impala/catalog/PartitionStatsUtil.java#L182

  was:
Impala stores incremental stats ([1] and [2]) within the partition params of 
each partition object. The total bytes consumed by the incremental stats is 
estimated as:
{noformat}
200 bytes per column * num_columns * num_partitions  bytes
{noformat}
This means that for wide tables the partition object can get bloated and for a 
few thousand partitions, this adds up.  It also affects Hive because the same 
partition objects are fetched by Hive and can lead to memory pressure even 
though Hive does not need those incremental stats. 

The intent of the partition parameters in a Partition object was primarily to 
keep certain properties or small objects.  We should move incremental stats out 
of the partition params and store them separately in HMS. There is a 
PART_COL_STATS in HMS that could potentially be used (with some redesign of the 
schema) or a separate table could be added. 

Additional notes:
* Only the latest version of incremental stats is stored. They are stored in 
partition parameters using keys like "impala_intermediate_stats_num_chunks" and 
"impala_intermediate_stats_chunk0", "impala_intermediate_stats_chunk1", 
"impala_intermediate_stats_chunk2", etc. The chunks are used to cap each value 
size to be < 4000:
* Impala compresses the incr stats of all columns into a byte array for each 
partition. It'd be nice if HMS can compress them as well to save space.


[1] 
https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L261
[2] 
https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L232
[3] 
https://github.com/apache/impala/blob/419aa2e30db326f02e9b4ec563ef7864e82df86e/fe/[…]src/main/java/org/apache/impala/catalog/PartitionStatsUtil.java


> Move incremental statistics out of the partition params to reduce partition's 
> memory footprint
> ----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11969
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11969
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog, Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Aman Sinha
>            Priority: Major
>
> Impala stores incremental stats ([1] and [2]) within the partition params of 
> each partition object. The total bytes consumed by the incremental stats is 
> estimated as:
> {noformat}
> 200 bytes per column * num_columns * num_partitions  bytes
> {noformat}
> This means that for wide tables the partition object can get bloated and for 
> a few thousand partitions, this adds up.  It also affects Hive because the 
> same partition objects are fetched by Hive and can lead to memory pressure 
> even though Hive does not need those incremental stats. 
> The intent of the partition parameters in a Partition object was primarily to 
> keep certain properties or small objects.  We should move incremental stats 
> out of the partition params and store them separately in HMS. There is a 
> PART_COL_STATS in HMS that could potentially be used (with some redesign of 
> the schema) or a separate table could be added. 
> Additional notes:
> * Only the latest version of incremental stats is stored. They are stored in 
> partition parameters using keys like "impala_intermediate_stats_num_chunks" 
> and "impala_intermediate_stats_chunk0", "impala_intermediate_stats_chunk1", 
> "impala_intermediate_stats_chunk2", etc. The chunks are used to cap each 
> value size to be < 4000.   [3]
> * Impala compresses the incr stats of all columns into a byte array for each 
> partition. It'd be nice if HMS can compress them as well to save space.
> [1] 
> https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L261
> [2] 
> https://github.com/apache/impala/blob/master/common/thrift/CatalogObjects.thrift#L232
> [3] 
> https://github.com/apache/impala/blob/419aa2e30db326f02e9b4ec563ef7864e82df86e/fe/src/main/java/org/apache/impala/catalog/PartitionStatsUtil.java#L182



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-11969) Move incremental statistics out of the partition params to reduce partition's memory footprint

Reply via email to