abmo-x opened a new pull request, #7363:
URL: https://github.com/apache/iceberg/pull/7363

   AddFilesProcedure uses InMemoryFileIndex to get SparkPartitions. 
InMemoryFileIndex is not used efficiently as it lists _all_ the files in a 
given path to discover partitions which is memory intensive and not really 
needed as we just want to find the partitions. This can be achieved without 
listing all the files
   
   This PR replaces InMemoryFileIndex  with a custom implementation of 
_PartitioningAwareFileIndex_ which avoids listing _all_ files and only lists 
the directories to discover partitions. 
   
   This optimization reduces _listPartitions_ _latency_ by > 90% and _memory_ 
usage more than 3 times. 
   
   Addresses Issue: https://github.com/apache/iceberg/issues/7027
   
   Tested with input S3 folder with large number of files.
   
   All Files in path: 
   Filter on date and hour, where single partition has 41,417 files
   
   | Cluster Config  | Before |  After     |  Improvement     |
   | ------------- | ------------- | ------------- | ------------- |
   | Driver Memory | 128GB (Fails with OOM if < 128GB)  | 8GB | ~93% |
   | Executor Memory  | 64GB  | 8GB | ~87% |
   | Num of Executors  | 20 |4  | ~80% |
   | List Partitions Latency  | 12M49.459S  | 0.774S  | ~99% |
   
   Latency to add the files in all partitions once they are listed is the same, 
before 2H25M14.905S and after 2H26M15.803S
   
   2H25M14.905S | 2H26M15.803S
   -- | --
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to