Thanks Jack for the question and for your efforts learning these concepts.
There isn’t a safety issue here. There are two services at play: Archival
and Cleaning. In Hudi, the Archival process moves older commit information
in the timeline to the archived directory. The actual data itself is not
cleaned up by the archival process; this is done by the cleaner process
based on the cleaner settings. The cleaner only removes older versions of a
filegroup. If there is only one file slice (version) in a file group (e.g.,
no updates since the data was first written), it will remain untouched by
the cleaner.

For snapshot queries, all query engines can identify the latest file slice
in every file group and read from that. Even if the older commit metadata
for a file group are archived, the file group itself remains accessible.
However, for time travel and incremental queries, the commits metadata is
necessary to track changes over time. Archiving older commit info limits
how far back you can go for these types of queries, restricting them to the
oldest commit in the active timeline. This also has implications for
rollbacks and restores. When commit metadata from timeline are archived,
all side effects are removed from storage. In other words, that arbitrary
number is how far back we keep history of metadata in the timeline. The
latest committed data for all file groups is always available for querying.


Thanks,

Sudha

On Tue, Aug 6, 2024 at 9:24 AM Jack Vanlightly <vanligh...@apache.org>
wrote:

> Hi all,
>
> In April I wrote a formal specification for COW tables (
>
> https://github.com/Vanlightly/table-formats-tlaplus/tree/main/hudi/v5_spec/basic_cow
> )
> and since then I was looking at possibly going back and adding MOR as well
> as archival and compaction.
>
> I've read the code, read the docs and there's something that I can't figure
> out about timeline archival - how does Hudi prevent the archive process
> from archiving "live" instants? If for example, I have a primary key table
> with 2 file groups, and "min commits to keep" is 20 but the last 20 commits
> are all related to file group 2, then the commits of file group 1 would be
> archived, making file group 1 unreadable.
>
> Delta Lake handles log cleaning via checkpointing. Once a checkpoint has
> been inserted into the Delta Log, prior entries can be removed. But with
> Hudi, it seems you choose an arbitrary number of commits to keep, and so I
> am left wondering how it can be safe?
>
> I am sure I have missed something, thanks in advance.
>
> Jack Vanlightly
>

Reply via email to