[ https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang updated IMPALA-13177: ------------------------------------ Description: File names under a table usually share some substrings, e.g. query id, job id, task id, etc. We can compress them to save some memory space. Especially in the case of small files issue, the memory footprint of the metadata cache is occupied by encodedFileDescriptors. An experiment shows that an HdfsTable with 67708 partitions and 3167561 files on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each encodedFileDescriptor is a byte array that takes 160B. Codes: [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723] Files of that table are created by Spark jobs. An example file name: part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000 Here are some file names inside the same partition: !Selection_124.png|width=410,height=172! By compressing the encodedFileDescriptors inside the same partition, we should be able to save a significant memory space in this case. Compressing all of them inside the same table might be even better, but it impacts the performance when coordinator loading specific partitions from catalogd. was: File names under a table usually share some substrings, e.g. query id, job id, task id, etc. We can compress them to save some memory space. Especially in the case of small files issue, the memory footprint of the metadata cache is occupied by encodedFileDescriptors. An experiment shows that an HdfsTable with 67708 partitions and 3167561 files on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each encodedFileDescriptor is a byte array that takes 160B. Codes: https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723 Files of that table are created by Spark jobs. An example file name: part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000 Here are some file names inside the same partition: !Selection_124.png! By compressing the encodedFileDescriptors inside the same partition, we should be able to save a significant memory space in this case. Compressing all of them inside the same table might be even better, but it impacts the performance when coordinator loading specific partitions from catalogd. > Compress encodedFileDescriptors inside the same partition > --------------------------------------------------------- > > Key: IMPALA-13177 > URL: https://issues.apache.org/jira/browse/IMPALA-13177 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Reporter: Quanlong Huang > Assignee: Quanlong Huang > Priority: Critical > Attachments: Selection_124.png > > > File names under a table usually share some substrings, e.g. query id, job > id, task id, etc. We can compress them to save some memory space. Especially > in the case of small files issue, the memory footprint of the metadata cache > is occupied by encodedFileDescriptors. > An experiment shows that an HdfsTable with 67708 partitions and 3167561 files > on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each > encodedFileDescriptor is a byte array that takes 160B. Codes: > [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723] > Files of that table are created by Spark jobs. An example file name: > part-00006-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000 > Here are some file names inside the same partition: > !Selection_124.png|width=410,height=172! > By compressing the encodedFileDescriptors inside the same partition, we > should be able to save a significant memory space in this case. Compressing > all of them inside the same table might be even better, but it impacts the > performance when coordinator loading specific partitions from catalogd. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org