Re: [I] removing orphan files, when using S3, could potentially be done via S3 Lifecycles? [iceberg]

via GitHub Thu, 07 Aug 2025 07:08:48 -0700


RussellSpitzer commented on issue #13693:
URL: https://github.com/apache/iceberg/issues/13693#issuecomment-3164360223


   I don't think the "delete" portion is generally that slow since with the 
bulk apis we now have a single thread can usually issue all the deletes 
relatively quickly. I do know some implementers do as @jkolash has described 
and change the consumer of the method to instead put paths to be deleted in a 
queue and have another service do the actual cleanup.
   
   The most expensive part of this job tends to be the actual file listing. 
Thats why we have the option of feeding in the list of "existing" files as a 
Dataframe in the Spark option. This let's a user turn on S3 Inventory or alike 
services and the implementation will use that instead of actually using 
S3Listing. I would definitely try that first, or benchmark before trying 
something with tags. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] removing orphan files, when using S3, could potentially be done via S3 Lifecycles? [iceberg]

Reply via email to