Ryan, Thanks for the helpful explanation of some of the more complicated scenarios. I think with our relatively straightforward use cases we’ll be able to stick with the ExpireSnapshots API.
Casey [Dynata]<http://www.dynata.com/> Casey Lucas Director, Engineering dynata.com<http://www.dynata.com> [cid:WIN-13265-English_c880c3ff-ec59-4aa6-b509-49df386623c4.png]<https://www.dynata.com/resources/dynata-global-trends-report/?utm_source=Email&utm_medium=SignatureBanner&utm_campaign=Consumer%20Trends%3A%20New%20Lives> The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Ryan Blue <[email protected]> Date: Monday, June 28, 2021 at 12:03 PM To: [email protected] <[email protected]> Subject: Re: [EXT] Re: question about the gc in iceberg Casey, no problem. I probably should have explained more to begin with. The ExpireSnapshots API operation works by looking at the changes in a snapshot when that snapshot ages off. So any files deleted by that snapshot can be deleted from the file system because they are no longer referenced by the table. That works great in most cases and it is what we used in production at Netflix for years. But, there are some newer cases where that doesn't work so well. One case where that doesn't work very well is when you have more complex changes to table versions/snapshots than a linear history. For example, you might stage commits, validate the data, and then cherry-pick the commit to be the current version. So there are commits that may not be in the current table state. That can also happen when you roll back the table and then commit again. In those cases, we don't delete the files deleted when the snapshot expires because those changes didn't happen in the current table state. Instead, we delete the files added by the commit because those aren't in the current state. You can see that this logic gets complicated quickly. And the cases where we cherry-pick a commit to master that make it even more complex because you have multiple copies (that reference one another). The complexity of incremental expiration made us go a different direction with the action. Instead of trying to figure out from history whether deletes can happen to physical files or not, we changed to using the set of all data and metadata files referenced by the table. We diff the two sets and remove anything that is no longer referenced. That's simpler and handles cases we never would be able to before, like when someone mistakenly deletes a file and then commits the same file again through the API. The drawback is that this is a lot more work and requires Spark. So we keep both implementations, but we recommend using the action-based one. Ryan On Sat, Jun 26, 2021 at 6:26 AM Casey Lucas <[email protected]<mailto:[email protected]>> wrote: Hi Ryan, We’re using iceberg but not with spark. Can you provide any specifics on why you say the java call made outside of spark “isn’t as good as the action-based one”? Thanks, Casey [Dynata]<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.dynata.com%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915389854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=E1bK1aBdknKICH%2BE32DpmbCRj%2FVsQmPIPSiSk%2BNOTdA%3D&reserved=0> Casey Lucas Director, Engineering dynata.com<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.dynata.com%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915394843%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hqmSYEPTl%2BKwaBM5XIkkWMC45GeIFwbsthZQKDc8cqk%3D&reserved=0> [cid:17a5388d9bdc204bfcc1]<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dynata.com%2Fresources%2Fdynata-global-trends-report%2F%3Futm_source%3DEmail%26utm_medium%3DSignatureBanner%26utm_campaign%3DConsumer%2520Trends%253A%2520New%2520Lives&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915399833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Sk98CdQpK10c67sav8wKMQTDQSU7I9LOk0c4N2KbRxk%3D&reserved=0> The information contained in this e-mail message is intended for the use of the recipient(s) named above and is privileged and confidential. If you are not the intended recipient, you are formally notified that you have received this message in error and that any review, dissemination, distribution, or copying of the message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. From: Ryan Blue <[email protected]<mailto:[email protected]>> Date: Wednesday, June 23, 2021 at 11:48 AM To: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Subject: [EXT] Re: question about the gc in iceberg CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. There is also a way to expire snapshots without using Spark, through the ExpireSnapshots API: table.expireSnapshots().expireOlderThan(timestampInMs).commit(); That is what we used in production for a long time, but it isn’t as good as the action-based one that compares file trees. I’d recommend using the expire_snapshots procedure that Russell pointed to: https://iceberg.apache.org/spark-procedures/#expire_snapshots<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark-procedures%2F%23expire_snapshots&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915404822%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8yIr6AWj3BsLqMhHciypL0hHHzdlSa9GMPTgOm0t8bM%3D&reserved=0> On Wed, Jun 23, 2021 at 7:49 AM Russell Spitzer <[email protected]<mailto:[email protected]>> wrote: There are "actions" which contain common table maintenance things, You are most likely interested in ExpireSnapshots, RewriteDataFiles and RemoveOrphanFiles see https://iceberg.apache.org/spark-procedures/<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark-procedures%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915414803%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gLqJGAuYBa9XSe0Mk3rWqA%2Bvu1txi2pQej1K%2F7IilhI%3D&reserved=0> On Tue, Jun 22, 2021 at 7:19 PM yong.sunny <[email protected]<mailto:[email protected]>> wrote: Hi Iceberg Dev, Is there any exising mechanism to do GC in iceberg? Or there is an implementation based on Spark? Thanks and Best regards, Yong -- Ryan Blue Tabular -- Ryan Blue Tabular
