[ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:
------------------------------------
    Labels: catalog-2024  (was: )

> Compress encodedFileDescriptors inside the same partition
> ---------------------------------------------------------
>
>                 Key: IMPALA-13177
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13177
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: catalog-2024
>         Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to