Hi all,

In April I wrote a formal specification for COW tables (
https://github.com/Vanlightly/table-formats-tlaplus/tree/main/hudi/v5_spec/basic_cow)
and since then I was looking at possibly going back and adding MOR as well
as archival and compaction.

I've read the code, read the docs and there's something that I can't figure
out about timeline archival - how does Hudi prevent the archive process
from archiving "live" instants? If for example, I have a primary key table
with 2 file groups, and "min commits to keep" is 20 but the last 20 commits
are all related to file group 2, then the commits of file group 1 would be
archived, making file group 1 unreadable.

Delta Lake handles log cleaning via checkpointing. Once a checkpoint has
been inserted into the Delta Log, prior entries can be removed. But with
Hudi, it seems you choose an arbitrary number of commits to keep, and so I
am left wondering how it can be safe?

I am sure I have missed something, thanks in advance.

Jack Vanlightly

Reply via email to