[ 
https://issues.apache.org/jira/browse/IMPALA-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-13254 started by Fu Lili.
----------------------------------------
> Optimizing incremental reload performance of Iceberg tables
> -----------------------------------------------------------
>
>                 Key: IMPALA-13254
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13254
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 4.4.0
>            Reporter: Fu Lili
>            Assignee: Fu Lili
>            Priority: Major
>
> When performing a {{REFRESH}} on an Iceberg table, if the number of changed 
> files exceeds the {{iceberg_reload_new_files_threshold}} configuration 
> (default is 100), a highly inefficient reload operation is triggered.
> The main issue with this code lies in the 
> {{IcebergFileMetadataLoader.getFileStatuses}} function. During incremental 
> loading, the {{listWithLocations}} parameter is always set to {{{}false{}}}, 
> resulting in {{fs.getFileStatus}} and {{fs.getFileBlockLocations}} operations 
> being performed on each {{contentFile}} sequentially (if the filesystem 
> supports {{{}StorageIds{}}}).
> To optimize this logic, the following changes can be made:
>  # In the {{IcebergFileMetadataLoader.getFileStatuses}} function, always 
> trigger {{parallelListing}} to quickly retrieve {{{}nameToFileStatus{}}}, 
> avoiding the sequential fetching for each {{{}contentFile{}}}.
>  # Increase the default value of {{iceberg_reload_new_files_threshold}} to 
> 1000. When changes are fewer than {{{}iceberg_reload_new_files_threshold{}}}, 
> perform a single RPC for each changed file to get the {{{}FileDescriptor{}}}. 
> The average time for a single operation is 1 to 3 milliseconds, so 1000 
> operations would take approximately 1 to 3 seconds, which is within a 
> reasonable range.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to