[ https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang updated IMPALA-13177: ------------------------------------ Labels: catalog-2024 (was: ) > Compress encodedFileDescriptors inside the same partition > --------------------------------------------------------- > > Key: IMPALA-13177 > URL: https://issues.apache.org/jira/browse/IMPALA-13177 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Reporter: Quanlong Huang > Assignee: Quanlong Huang > Priority: Critical > Labels: catalog-2024 > Attachments: Selection_124.png > > > File names under a table usually share some substrings, e.g. query id, job > id, task id, etc. We can compress them to save some memory space. Especially > in the case of small files issue, the memory footprint of the metadata cache > is occupied by encodedFileDescriptors. > An experiment shows that an HdfsTable with 67708 partitions and 3167561 files > on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each > encodedFileDescriptor is a byte array that takes 160B. Codes: > [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723] > Files of that table are created by Spark jobs. An example file name: > part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000 > Here are some file names inside the same partition: > !Selection_124.png|width=410,height=172! > By compressing the encodedFileDescriptors inside the same partition, we > should be able to save a significant memory space in this case. Compressing > all of them inside the same table might be even better, but it impacts the > performance when coordinator loading specific partitions from catalogd. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org