szehon-ho opened a new issue, #10646:
URL: https://github.com/apache/iceberg/issues/10646
### Proposed Change
**Motivation**
Currently, a snapshot's lifecycle is handled by 'ExpireSnapshots(long
olderThan)'. This operation does the following:
- Choose a set of snapshots to expire based on timestamp
- Remove association of these Snapshots from TableMetadata
- Purge metadata of these Snapshots
- Purge data files of these Snapshots that are not referred to by
non-expired snapshots (ie, data files that have been deleted from the table
before the olderThan timestamp).
Purging deleted data often requires a more aggressive timeline, due to
strict requirements to claw back unused disk space, fulfill data lifecycle
compliance, etc. In many deployments, this means 'olderThan' timestamp is set
to just a few days before the current time (the default is 5 days).
On the other hand, purging metadata may be ideally done on a more relaxed
timeline, to allow for meaningful historical table analysis. This could
ideally be months, or years.
But today, the two are purged together and we cannot preserve just the
Snapshot metadata, if we choose an aggressive olderThan timestamp for the
purpose of purging deleted Snapshot data.
**Implementation Summary**
Add an addition field to snapshot metadata
v3 | Field | Description
-- | -- | --
optional | expired | Whether this snapshot has been expired but not purged.
Defaults to false
In the reference implementation, improve ExpireSnapshots (Core, Spark) to
take another parameter:
```
/**
* Whether to maintain Snapshot metadata after expiry.
*/
ExpireSnapshot.purge(purge = true)
```
ExpireSnapshots will continue to purge deleted data files for the Snapshots
chosen for expiration as it does today. But now, if purge == false, the
Snapshot metadata is maintained, and TableMetadata maintains the Snapshot
reference (with the expired flag set to true on the Snapshots).
These 'expired' but un-purged Snapshots can be then dis-associated from
TableMetadata later by another call to ExpireSnapshots with purge == true,
which will also purge their metadata.
Expired but un-purged Snapshots behave as if effectively removed, and cannot
be the target of rollback or time-travel operations. This is because the data
files they refer to may have been purged by ExpireSnapshots operation. They
will, however, show up in the TableMetadata's list of snapshots, marked by
'expired' flag. Their metadata can also show up in the 'manifests' and 'files'
metadata tables, also marked with an 'expired' flag.
### Proposal document
https://docs.google.com/document/d/1m5K_XT7bckGfp8VrTe2093wEmEMslcTUE3kU_ohDn6A/edit
### Specifications
- [X] Table
- [ ] View
- [ ] REST
- [ ] Puffin
- [ ] Encryption
- [ ] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]