Thanks Jack for the question and for your efforts learning these concepts. There isn’t a safety issue here. There are two services at play: Archival and Cleaning. In Hudi, the Archival process moves older commit information in the timeline to the archived directory. The actual data itself is not cleaned up by the archival process; this is done by the cleaner process based on the cleaner settings. The cleaner only removes older versions of a filegroup. If there is only one file slice (version) in a file group (e.g., no updates since the data was first written), it will remain untouched by the cleaner.
For snapshot queries, all query engines can identify the latest file slice in every file group and read from that. Even if the older commit metadata for a file group are archived, the file group itself remains accessible. However, for time travel and incremental queries, the commits metadata is necessary to track changes over time. Archiving older commit info limits how far back you can go for these types of queries, restricting them to the oldest commit in the active timeline. This also has implications for rollbacks and restores. When commit metadata from timeline are archived, all side effects are removed from storage. In other words, that arbitrary number is how far back we keep history of metadata in the timeline. The latest committed data for all file groups is always available for querying. Thanks, Sudha On Tue, Aug 6, 2024 at 9:24 AM Jack Vanlightly <vanligh...@apache.org> wrote: > Hi all, > > In April I wrote a formal specification for COW tables ( > > https://github.com/Vanlightly/table-formats-tlaplus/tree/main/hudi/v5_spec/basic_cow > ) > and since then I was looking at possibly going back and adding MOR as well > as archival and compaction. > > I've read the code, read the docs and there's something that I can't figure > out about timeline archival - how does Hudi prevent the archive process > from archiving "live" instants? If for example, I have a primary key table > with 2 file groups, and "min commits to keep" is 20 but the last 20 commits > are all related to file group 2, then the commits of file group 1 would be > archived, making file group 1 unreadable. > > Delta Lake handles log cleaning via checkpointing. Once a checkpoint has > been inserted into the Delta Log, prior entries can be removed. But with > Hudi, it seems you choose an arbitrary number of commits to keep, and so I > am left wondering how it can be safe? > > I am sure I have missed something, thanks in advance. > > Jack Vanlightly >