abmo-x opened a new pull request, #7363: URL: https://github.com/apache/iceberg/pull/7363
AddFilesProcedure uses InMemoryFileIndex to get SparkPartitions. InMemoryFileIndex is not used efficiently as it lists _all_ the files in a given path to discover partitions which is memory intensive and not really needed as we just want to find the partitions. This can be achieved without listing all the files This PR replaces InMemoryFileIndex with a custom implementation of _PartitioningAwareFileIndex_ which avoids listing _all_ files and only lists the directories to discover partitions. This optimization reduces _listPartitions_ _latency_ by > 90% and _memory_ usage more than 3 times. Addresses Issue: https://github.com/apache/iceberg/issues/7027 Tested with input S3 folder with large number of files. All Files in path: Filter on date and hour, where single partition has 41,417 files | Cluster Config | Before | After | Improvement | | ------------- | ------------- | ------------- | ------------- | | Driver Memory | 128GB (Fails with OOM if < 128GB) | 8GB | ~93% | | Executor Memory | 64GB | 8GB | ~87% | | Num of Executors | 20 |4 | ~80% | | List Partitions Latency | 12M49.459S | 0.774S | ~99% | Latency to add the files in all partitions once they are listed is the same, before 2H25M14.905S and after 2H26M15.803S 2H25M14.905S | 2H26M15.803S -- | -- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
