[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-25 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. Here are some file names inside 
the same partition:
{noformat}
part-0-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-1-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-2-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-3-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-4-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-5-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-6-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-7-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-8-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-9-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00010-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00011-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00012-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00013-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00014-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
part-00015-14015d2b-b534-4747-8c42-c83a7af0f006-71fda97e-a41d-488f-aa15-6fd9112b6c5b.c000
 {noformat}
By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 

[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

We can consider only do this for partitions whose number of files exceeds a 
threshold (e.g. 10).

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.
> We can consider only do this for partitions whose number of files exceeds a 
> threshold (e.g. 10).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Labels: catalog-2024  (was: )

> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
>  Labels: catalog-2024
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
[https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
!Selection_124.png|width=410,height=172!

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
 !Selection_124.png! 

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> [https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723]
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
> !Selection_124.png|width=410,height=172!
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org



[jira] [Updated] (IMPALA-13177) Compress encodedFileDescriptors inside the same partition

2024-06-24 Thread Quanlong Huang (Jira)


 [ 
https://issues.apache.org/jira/browse/IMPALA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang updated IMPALA-13177:

Description: 
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:
 !Selection_124.png! 

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.

  was:
File names under a table usually share some substrings, e.g. query id, job id, 
task id, etc. We can compress them to save some memory space. Especially in the 
case of small files issue, the memory footprint of the metadata cache is 
occupied by encodedFileDescriptors.

An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
encodedFileDescriptor is a byte array that takes 160B. Codes:
https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723

Files of that table are created by Spark jobs. An example file name: 
part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
Here are some file names inside the same partition:

By compressing the encodedFileDescriptors inside the same partition, we should 
be able to save a significant memory space in this case. Compressing all of 
them inside the same table might be even better, but it impacts the performance 
when coordinator loading specific partitions from catalogd.


> Compress encodedFileDescriptors inside the same partition
> -
>
> Key: IMPALA-13177
> URL: https://issues.apache.org/jira/browse/IMPALA-13177
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Catalog
>Reporter: Quanlong Huang
>Assignee: Quanlong Huang
>Priority: Critical
> Attachments: Selection_124.png
>
>
> File names under a table usually share some substrings, e.g. query id, job 
> id, task id, etc. We can compress them to save some memory space. Especially 
> in the case of small files issue, the memory footprint of the metadata cache 
> is occupied by encodedFileDescriptors.
> An experiment shows that an HdfsTable with 67708 partitions and 3167561 files 
> on S3 takes 605MB. 80% of it is spent in encodedFileDescriptors. Each 
> encodedFileDescriptor is a byte array that takes 160B. Codes:
> https://github.com/apache/impala/blob/6632fd00e17867c9f8f40d6905feafa049368a98/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L723
> Files of that table are created by Spark jobs. An example file name: 
> part-6-f7e5265d-5a63-4477-8954-ac6cbaef553b-face6153-588c-4b44-a277-2836396bc57a.c000
> Here are some file names inside the same partition:
>  !Selection_124.png! 
> By compressing the encodedFileDescriptors inside the same partition, we 
> should be able to save a significant memory space in this case. Compressing 
> all of them inside the same table might be even better, but it impacts the 
> performance when coordinator loading specific partitions from catalogd.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org