RussellSpitzer commented on issue #13693: URL: https://github.com/apache/iceberg/issues/13693#issuecomment-3164360223
I don't think the "delete" portion is generally that slow since with the bulk apis we now have a single thread can usually issue all the deletes relatively quickly. I do know some implementers do as @jkolash has described and change the consumer of the method to instead put paths to be deleted in a queue and have another service do the actual cleanup. The most expensive part of this job tends to be the actual file listing. Thats why we have the option of feeding in the list of "existing" files as a Dataframe in the Spark option. This let's a user turn on S3 Inventory or alike services and the implementation will use that instead of actually using S3Listing. I would definitely try that first, or benchmark before trying something with tags. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org