[ https://issues.apache.org/jira/browse/IMPALA-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on IMPALA-13254 started by Fu Lili. ---------------------------------------- > Optimizing incremental reload performance of Iceberg tables > ----------------------------------------------------------- > > Key: IMPALA-13254 > URL: https://issues.apache.org/jira/browse/IMPALA-13254 > Project: IMPALA > Issue Type: Improvement > Components: Catalog > Affects Versions: Impala 4.4.0 > Reporter: Fu Lili > Assignee: Fu Lili > Priority: Major > > When performing a {{REFRESH}} on an Iceberg table, if the number of changed > files exceeds the {{iceberg_reload_new_files_threshold}} configuration > (default is 100), a highly inefficient reload operation is triggered. > The main issue with this code lies in the > {{IcebergFileMetadataLoader.getFileStatuses}} function. During incremental > loading, the {{listWithLocations}} parameter is always set to {{{}false{}}}, > resulting in {{fs.getFileStatus}} and {{fs.getFileBlockLocations}} operations > being performed on each {{contentFile}} sequentially (if the filesystem > supports {{{}StorageIds{}}}). > To optimize this logic, the following changes can be made: > # In the {{IcebergFileMetadataLoader.getFileStatuses}} function, always > trigger {{parallelListing}} to quickly retrieve {{{}nameToFileStatus{}}}, > avoiding the sequential fetching for each {{{}contentFile{}}}. > # Increase the default value of {{iceberg_reload_new_files_threshold}} to > 1000. When changes are fewer than {{{}iceberg_reload_new_files_threshold{}}}, > perform a single RPC for each changed file to get the {{{}FileDescriptor{}}}. > The average time for a single operation is 1 to 3 milliseconds, so 1000 > operations would take approximately 1 to 3 seconds, which is within a > reasonable range. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org