Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Szehon Ho Tue, 16 Jul 2024 15:02:27 -0700

Hi,

Thanks for reading through the proposal and the good feedback. I was
thinking about the mentioned concerns:


   - The motivation for the change
   - Too much additional metadata (storage overhead, namenode pressure on
   HDFS)
   - Performance impact for read/writing TableMetadata
   - Some impact to existing Table API's, and maintenance procedures, to
   have to check for these snapshots

I chatted a bit offline with Yufei to brainstorm, and I wrote a V2 of the
proposal at the same link:
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit.
I also tried to clarify the motivation in the doc with actual metadata
table queries that would be possible.

This version now simply adds an optional 'expired-snapshots-path' that
contains the metadata of expired Snapshots.  I think this should address
the above concerns:

   - Minimal storage overhead for just snapshot references (capped).  I
   don't propose anymore to keep old snapshot manifest-list/manifest files,
   the snapshot reference to the expired snapshot should be a good start.
   - Minimize perf overhead of read/write TableMetadata.  The additional
   file is only written by ExpireSnapshots if feature is enabled, and only
   read on demand (via metadata table query for example)
   - No impact to other Table APIs or maintenance procedures (as these dont
   show up as regular table.snapshots() list anymore).
   - Only additive optional spec change (backwards compatible)

Of course, again, this feature is possible outside Iceberg, but the
advantage of doing it in Iceberg is that it could be integrated into
ExpireSnapshots and Metadata Table frameworks.

Curious what people think?

Thanks
Szehon

On Wed, Jul 10, 2024 at 1:44 AM Péter Váry <[email protected]>
wrote:

> > I believe DeleteOrphanFiles may be ok as is, because currently the logic
> walks down the reachable graph and marks those metadata files as
> 'not-orphan', so it should naturally walk these 'expired' snapshots as well.
>
> We need to keep the metadata files, but remove data files if they are not
> removed for whatever reason. Doable, but logic change.
>
> > You mean purging expired snapshots in the middle of the history, right?
> I think the current mechanism for this is 'tagging' and 'branching'.
>
> I think for most users the compaction commits are technical details which
> they would like to avoid / don't want to see. The real table history is
> only the changes initiated by the user, and it would be good to hide the
> technical/compaction commits.
>
>
> On Wed, Jul 10, 2024, 08:52 himadri pal <[email protected]> wrote:
>
>> Hi Szehon,
>>
>> This is a good idea considering the use case it intends to solve. Added
>> few questions and comments in the design doc.
>>
>> IMO , Alternate options considered specified in the design doc look
>> cleaner to me.
>>
>> I think, it might add to maintenance burden, now that we need to remember
>> to remove these metadata only snapshots.
>>
>> Also I wonder some of the use cases it intends to address, is solvable by
>> metadata alone? - i.e how much data was added in a given time range? - May
>> be to answer these kind of questions user would prefer a to create KPI
>> using columns in the dataset.
>>
>>
>> Regards,
>> Himadri Pal
>>
>>
>> On Tue, Jul 9, 2024 at 11:10 PM Steven Wu <[email protected]> wrote:
>>
>>> I am not totally convinced of the motivation yet.
>>>
>>> I thought the snapshot retention window is primarily meant for time
>>> travel and troubleshooting table changes that happened recently (like a few
>>> days or weeks).
>>>
>>> Is it valuable enough to keep expired snapshots for as long as months or
>>> years? While metadata files are typically smaller than data files in total
>>> size, it can still be significant considering the default amount of column
>>> stats written today (especially for wide tables with many columns).
>>>
>>> How long are we going to keep the expired snapshot references by
>>> default? If it is months/years, it can have major implications on the query
>>> performance of metadata tables (like snapshots, all_*).
>>>
>>> I assume it will also have some performance impact on table loading as a
>>> lot more expired snapshots are still referenced.
>>>
>>>
>>>
>>>
>>> On Tue, Jul 9, 2024 at 6:36 PM Szehon Ho <[email protected]>
>>> wrote:
>>>
>>>> Thanks Peter and Yufei.
>>>>
>>>> Yes, in terms of implementation, I noted in the doc we need to add
>>>> error checks to prevent time-travel / rollback / cherry-pick operations to
>>>> 'expired' snapshots.  I'll make it more clear in the doc, which operations
>>>> we need to check against.
>>>>
>>>> I believe DeleteOrphanFiles may be ok as is, because currently the
>>>> logic walks down the reachable graph and marks those metadata files as
>>>> 'not-orphan', so it should naturally walk these 'expired' snapshots as 
>>>> well.
>>>>
>>>> So, I think the main changes in terms of implementations is going to be
>>>> adding error checks in those Table API's, and updating ExpireSnapshots API.
>>>>
>>>> Do we want to consider expiring snapshots in the middle of the history
>>>>> of the table?
>>>>>
>>>> You mean purging expired snapshots in the middle of the history,
>>>> right?  I think the current mechanism for this is 'tagging' and
>>>> 'branching'.  So interestingly, I was thinking its related to your other
>>>> question, and if we don't add error-check to 'tagging' and 'branching' on
>>>> 'expired' snapshot, it could be handled just as they are handled today for
>>>> other snapshots.  Its one option.  We could support it subsequently as well
>>>> , after the first version and if there's some usage of this.
>>>>
>>>> One thing that comes up in this thread and google doc is some question
>>>> about the size of preserved metadata.  I had put in the Alternatives
>>>> section, that we could potentially make the ExpireSnapshots purge boolean
>>>> argument more nuanced like PURGE, PRESERVE_REFS (snapshot refs are
>>>> preserved), PRESERVE_METADATA (snapshot refs and all metadata files are
>>>> preserved), though I am still debating if its worth it, as users could
>>>> choose not to use this feature.
>>>>
>>>> Thanks
>>>> Szehon
>>>>
>>>>
>>>>
>>>> On Tue, Jul 9, 2024 at 6:02 PM Yufei Gu <[email protected]> wrote:
>>>>
>>>>> Thank you for the interesting proposal. With a minor specification
>>>>> change, it could indeed enable different retention periods for data files
>>>>> and metadata files. This differentiation is useful for two reasons:
>>>>>
>>>>>    1. More metadata helps us better understand the table history,
>>>>>    providing valuable insights.
>>>>>    2. Users often prioritize data file deletion as it frees up
>>>>>    significant storage space and removes potentially sensitive data.
>>>>>
>>>>> However, adding a boolean property to the specification isn't
>>>>> necessarily a lightweight solution. As Peter mentioned, implementing this
>>>>> change requires modifications in several places. In this context, external
>>>>> systems like LakeChime or a REST catalog implementation could offer
>>>>> effective solutions to manage extended metadata retention periods, without
>>>>> spec changes.
>>>>>
>>>>> I am neutral on this proposal (+0) and look forward to seeing more
>>>>> input from people.
>>>>> Yufei
>>>>>
>>>>>
>>>>> On Mon, Jul 8, 2024 at 10:32 PM Péter Váry <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> We need to handle expired snapshots in several places differently in
>>>>>> Iceberg core as well.
>>>>>> - We need to add checks to prevent scans read these snapshots and
>>>>>> throw a meaningful error.
>>>>>> - We need to add checks to prevent tagging/branching these snapshots
>>>>>> - We need to update DeleteOrphanFiles in Spark/Flink to not consider
>>>>>> files only referenced by the expired snapshots
>>>>>>
>>>>>> Some Flink jobs do frequent commits, and in these cases, the size of
>>>>>> the metadata file becomes a constraining factor too. In this case, we 
>>>>>> could
>>>>>> just tell not to use this feature, and expire the metadata as we do now,
>>>>>> but I thought it's worth to mention.
>>>>>>
>>>>>> Do we want to consider expiring snapshots in the middle of the
>>>>>> history of the table?
>>>>>> When we compact the table, then the compaction commits litter the
>>>>>> real history of the table. Consider the following:
>>>>>> - S1 writes some data
>>>>>> - S2 writes some more data
>>>>>> - S3 compacts the previous 2 commits
>>>>>> - S4 writes even more data
>>>>>> From the query engine user perspective S3 is a commit which does
>>>>>> nothing, not initiated by the user, and most probably they don't even 
>>>>>> want
>>>>>> to know of. If one can expire a snapshot from the middle of the history,
>>>>>> that would be nice, so users would see only S1/S2/S4. The only downside 
>>>>>> is
>>>>>> that reading S2 is less performant than reading S3, but IMHO this could 
>>>>>> be
>>>>>> acceptable for having only user driven changes in the table history.
>>>>>>
>>>>>>
>>>>>> In Mon, Jul 8, 2024, 20:15 Szehon Ho <[email protected]> wrote:
>>>>>>
>>>>>>> Thanks for the comments so far.  I also thought previously that this
>>>>>>> functionality would be in an external system, like LakeChime, or a 
>>>>>>> custom
>>>>>>> catalog extension.  But after doing an initial analysis (please double
>>>>>>> check), I thought it's a small enough change that it would be worth 
>>>>>>> putting
>>>>>>> in the Iceberg spec/API's for all users:
>>>>>>>
>>>>>>>    - Table Spec, only one optional boolean field (on Snapshot, only
>>>>>>>    set if the functionality is used).
>>>>>>>    - API, only one boolean parameter (on ExpireSnapshots).
>>>>>>>
>>>>>>> I do wonder, will keeping expired snapshots as is slow down
>>>>>>>> manifest/scan planning though (REST catalog approaches could probably
>>>>>>>> mitigate this)?
>>>>>>>>
>>>>>>>
>>>>>>> I think it should not slow down manifest/scan planning, because we
>>>>>>> plan using the current snapshot (or the one we specify via time travel),
>>>>>>> and we wouldn't read expired snapshots in this case.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Szehon
>>>>>>>
>>>>>>> On Mon, Jul 8, 2024 at 10:54 AM John Greene <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I do agree with the need that this proposal solves, to decouple the
>>>>>>>> snapshot history from the data deletion. I do wonder, will keeping 
>>>>>>>> expired
>>>>>>>> snapshots as is slow down manifest/scan planning though (REST catalog
>>>>>>>> approaches could probably mitigate this)?
>>>>>>>>
>>>>>>>> On Mon, Jul 8, 2024, 5:34 AM Piotr Findeisen <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Shehon, Walaa
>>>>>>>>>
>>>>>>>>> Thank Shehon for bringing this up. And thank you Walaa for proving
>>>>>>>>> more context from similar existing solution to the problem.
>>>>>>>>> The choices that LakeChime seems to have made -- to keep
>>>>>>>>> information in a separate RDBMS and which particular metadata 
>>>>>>>>> information
>>>>>>>>> to retain -- they indeed look as use-case specific, until we observe
>>>>>>>>> repeating patterns.
>>>>>>>>> The idea to bake lifecycle changes into table format spec was
>>>>>>>>> proposed as an alternative to the idea to bake lifecycle changes into 
>>>>>>>>> REST
>>>>>>>>> catalog spec. It was brought into discussion based on the intuition 
>>>>>>>>> that
>>>>>>>>> REST catalog is first-class citizen in Iceberg world, just like other
>>>>>>>>> catalogs, and so solutions to table-centric problems do not need to be
>>>>>>>>> limited to REST catalog. What is the information we retain, 
>>>>>>>>> how/whether
>>>>>>>>> this is configurable are open question and applicable to both avenues.
>>>>>>>>>
>>>>>>>>> As a 3rd/another alternative, we could focus on REST catalog
>>>>>>>>> *extensions*, without naming snapshot metadata lifecycle, and
>>>>>>>>> leave the problem up to REST's implementors. That would mean Iceberg
>>>>>>>>> project doesn't address snapshot metadata lifecycle changes topic 
>>>>>>>>> directly,
>>>>>>>>> but instead gives users tools to build solutions around it. At this 
>>>>>>>>> point I
>>>>>>>>> am not trying to judge whether it's a good idea or not. Probably 
>>>>>>>>> depends
>>>>>>>>> how important it is to solve the problem and have a common solution.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Piotr
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, 6 Jul 2024 at 09:46, Walaa Eldin Moustafa <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Szehon,
>>>>>>>>>>
>>>>>>>>>> Thanks for sharing this proposal. We have thought along the same
>>>>>>>>>> lines and implemented an external system (LakeChime [1]) that retains
>>>>>>>>>> snapshot + partition metadata for longer (actual internal 
>>>>>>>>>> implementation
>>>>>>>>>> keeps data for 13 months, but that can be tuned). For efficient 
>>>>>>>>>> analysis,
>>>>>>>>>> we have kept this data in an RDBMS. My opinion is this may be a 
>>>>>>>>>> better fit
>>>>>>>>>> to an external system (similar to LakeChime) since it could 
>>>>>>>>>> potentially
>>>>>>>>>> complicate the Iceberg spec, APIs, or their implementations. Also, 
>>>>>>>>>> the type
>>>>>>>>>> of metadata tracked can differ depending on the use case. For 
>>>>>>>>>> example,
>>>>>>>>>> while LakeChime retains partition and operation type metadata, it 
>>>>>>>>>> does not
>>>>>>>>>> track file-level metadata as there was no specific use case for that.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://www.linkedin.com/blog/engineering/data-management/lakechime-a-data-trigger-service-for-modern-data-lakes
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Walaa.
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 5, 2024 at 11:49 PM Szehon Ho <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi folks,
>>>>>>>>>>>
>>>>>>>>>>> I would like to discuss an idea for an optional extension of
>>>>>>>>>>> Iceberg's Snapshot metadata lifecycle.  Thanks Piotr for replying 
>>>>>>>>>>> on the
>>>>>>>>>>> other thread that this should be a fuller Iceberg format change.
>>>>>>>>>>>
>>>>>>>>>>> *Proposal Summary*
>>>>>>>>>>>
>>>>>>>>>>> Currently, ExpireSnapshots(long olderThan) purges metadata and
>>>>>>>>>>> deleted data of a Snapshot together.  Purging deleted data often 
>>>>>>>>>>> requires a
>>>>>>>>>>> smaller timeline, due to strict requirements to claw back unused 
>>>>>>>>>>> disk
>>>>>>>>>>> space, fulfill data lifecycle compliance, etc.  In many 
>>>>>>>>>>> deployments, this
>>>>>>>>>>> means 'olderThan' timestamp is set to just a few days before the 
>>>>>>>>>>> current
>>>>>>>>>>> time (the default is 5 days).
>>>>>>>>>>>
>>>>>>>>>>> On the other hand, purging metadata could be ideally done on a
>>>>>>>>>>> more relaxed timeline, such as months or more, to allow for 
>>>>>>>>>>> meaningful
>>>>>>>>>>> historical table analysis.
>>>>>>>>>>>
>>>>>>>>>>> We should have an optional way to purge Snapshot metadata
>>>>>>>>>>> separately from purging deleted data.  This would allow us to get 
>>>>>>>>>>> history
>>>>>>>>>>> of the table, and answer questions like:
>>>>>>>>>>>
>>>>>>>>>>>    - When was a file/partition added
>>>>>>>>>>>    - When was a file/partition deleted
>>>>>>>>>>>    - How much data was added or removed in time X
>>>>>>>>>>>
>>>>>>>>>>> that are currently only possible for data operations within a
>>>>>>>>>>> few days.
>>>>>>>>>>>
>>>>>>>>>>> *Github Proposal*:
>>>>>>>>>>> https://github.com/apache/iceberg/issues/10646
>>>>>>>>>>> *Google Design Doc*:
>>>>>>>>>>> https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
>>>>>>>>>>> <https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit>
>>>>>>>>>>>
>>>>>>>>>>> Curious if anyone has thought along these lines and/or sees
>>>>>>>>>>> obvious issues.  Would appreciate any feedback on the proposal.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Szehon
>>>>>>>>>>>
>>>>>>>>>>

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

Reply via email to