Ryan,
Thanks for the helpful explanation of some of the more complicated scenarios.  
I think with our relatively straightforward use cases we’ll be able to stick 
with the ExpireSnapshots API.

Casey





[Dynata]<http://www.dynata.com/>


Casey Lucas
Director, Engineering




dynata.com<http://www.dynata.com>

[cid:WIN-13265-English_c880c3ff-ec59-4aa6-b509-49df386623c4.png]<https://www.dynata.com/resources/dynata-global-trends-report/?utm_source=Email&utm_medium=SignatureBanner&utm_campaign=Consumer%20Trends%3A%20New%20Lives>

The information contained in this e-mail message is intended for the use of the 
recipient(s) named above and is privileged and confidential. If you are not the 
intended recipient, you are formally notified that you have received this 
message in error and that any review, dissemination, distribution, or copying 
of the message is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the original 
message.
From: Ryan Blue <[email protected]>
Date: Monday, June 28, 2021 at 12:03 PM
To: [email protected] <[email protected]>
Subject: Re: [EXT] Re: question about the gc in iceberg
Casey, no problem. I probably should have explained more to begin with.

The ExpireSnapshots API operation works by looking at the changes in a snapshot 
when that snapshot ages off. So any files deleted by that snapshot can be 
deleted from the file system because they are no longer referenced by the 
table. That works great in most cases and it is what we used in production at 
Netflix for years. But, there are some newer cases where that doesn't work so 
well.

One case where that doesn't work very well is when you have more complex 
changes to table versions/snapshots than a linear history. For example, you 
might stage commits, validate the data, and then cherry-pick the commit to be 
the current version. So there are commits that may not be in the current table 
state. That can also happen when you roll back the table and then commit again. 
In those cases, we don't delete the files deleted when the snapshot expires 
because those changes didn't happen in the current table state. Instead, we 
delete the files added by the commit because those aren't in the current state. 
You can see that this logic gets complicated quickly. And the cases where we 
cherry-pick a commit to master that make it even more complex because you have 
multiple copies (that reference one another).

The complexity of incremental expiration made us go a different direction with 
the action. Instead of trying to figure out from history whether deletes can 
happen to physical files or not, we changed to using the set of all data and 
metadata files referenced by the table. We diff the two sets and remove 
anything that is no longer referenced. That's simpler and handles cases we 
never would be able to before, like when someone mistakenly deletes a file and 
then commits the same file again through the API. The drawback is that this is 
a lot more work and requires Spark. So we keep both implementations, but we 
recommend using the action-based one.

Ryan

On Sat, Jun 26, 2021 at 6:26 AM Casey Lucas 
<[email protected]<mailto:[email protected]>> wrote:
Hi Ryan,
We’re using iceberg but not with spark. Can you provide any specifics on why 
you say the java call made outside of spark “isn’t as good as the action-based 
one”?
Thanks,
Casey





[Dynata]<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.dynata.com%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915389854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=E1bK1aBdknKICH%2BE32DpmbCRj%2FVsQmPIPSiSk%2BNOTdA%3D&reserved=0>


Casey Lucas
Director, Engineering



dynata.com<https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.dynata.com%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915394843%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hqmSYEPTl%2BKwaBM5XIkkWMC45GeIFwbsthZQKDc8cqk%3D&reserved=0>


[cid:17a5388d9bdc204bfcc1]<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dynata.com%2Fresources%2Fdynata-global-trends-report%2F%3Futm_source%3DEmail%26utm_medium%3DSignatureBanner%26utm_campaign%3DConsumer%2520Trends%253A%2520New%2520Lives&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915399833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Sk98CdQpK10c67sav8wKMQTDQSU7I9LOk0c4N2KbRxk%3D&reserved=0>

The information contained in this e-mail message is intended for the use of the 
recipient(s) named above and is privileged and confidential. If you are not the 
intended recipient, you are formally notified that you have received this 
message in error and that any review, dissemination, distribution, or copying 
of the message is strictly prohibited. If you have received this communication 
in error, please notify us immediately by e-mail and delete the original 
message.

From: Ryan Blue <[email protected]<mailto:[email protected]>>
Date: Wednesday, June 23, 2021 at 11:48 AM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: [EXT] Re: question about the gc in iceberg
CAUTION: This email originated from outside the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.



There is also a way to expire snapshots without using Spark, through the 
ExpireSnapshots API:

table.expireSnapshots().expireOlderThan(timestampInMs).commit();

That is what we used in production for a long time, but it isn’t as good as the 
action-based one that compares file trees. I’d recommend using the 
expire_snapshots procedure that Russell pointed to: 
https://iceberg.apache.org/spark-procedures/#expire_snapshots<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark-procedures%2F%23expire_snapshots&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915404822%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8yIr6AWj3BsLqMhHciypL0hHHzdlSa9GMPTgOm0t8bM%3D&reserved=0>

On Wed, Jun 23, 2021 at 7:49 AM Russell Spitzer 
<[email protected]<mailto:[email protected]>> wrote:
There are "actions" which contain common table maintenance things,

You are most likely interested in ExpireSnapshots, RewriteDataFiles and 
RemoveOrphanFiles see

https://iceberg.apache.org/spark-procedures/<https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fspark-procedures%2F&data=04%7C01%7CCasey.Lucas%40dynata.com%7Ccd0ccf1b270646b5a6bc08d93a569aae%7Cf0ff917dab8c4129b13f33be267a153b%7C0%7C0%7C637604965915414803%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=gLqJGAuYBa9XSe0Mk3rWqA%2Bvu1txi2pQej1K%2F7IilhI%3D&reserved=0>

On Tue, Jun 22, 2021 at 7:19 PM yong.sunny 
<[email protected]<mailto:[email protected]>> wrote:
Hi Iceberg Dev,

Is there any exising mechanism to do GC in iceberg? Or there is an 
implementation based on Spark?

Thanks and Best regards,
Yong






--
Ryan Blue
Tabular


--
Ryan Blue
Tabular

Reply via email to