Thanks Ryan. This makes sense. To re-iterator what you said, * Compaction can be done with RewriteFiles API and it creates a new table snapshot with a new set of files and data content unchanged. * Garbage collection can be done with ExpireSnapshots API and data files are deleted during the API calls so long as the reference count is 0. * Both processes are not automatic and need to be executed explicitly. e.g., there is no automatic expiration of snapshots based on time.
Thanks, Chen On Fri, Jun 26, 2020 at 3:13 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > Hi Chen, > > The "replace" operation indicates that although the files in a table > changed, the actual table data did not. Queries should produce the same > results, if they are deterministic. That's why we use it for file > compaction: although we replace small files with fewer, smaller files, the > overall contents doesn't change. To your question, yes, that's what the > RewriteFiles API does. > > All operations change the current table state. Each snapshot of a table is > a complete set of the data files that make up the table, and snapshots are > immutable. So you can't go back and change a snapshot from yesterday. What > you can do is replace small files in the current state with a compacted > large file. That creates a new snapshot that is used from then on. The > small files are still referenced and available as long as the old snapshot > exists, which is why snapshots should be cleaned up regularly with > ExpireSnapshots. That will delete files that are no longer referenced. > > We file referenced by old snapshots around for a couple reasons. First, > readers that started with a different current snapshot may still be reading > them. Second, it allows you to go back and read the table at an older point > in time -- time-travel queries. > > I hope that helps, > > rb > > On Fri, Jun 26, 2020 at 10:39 AM Chen Song <chen.song...@gmail.com> wrote: > >> Hey >> >> In Iceberg documentation, it mentions to use this for compaction >> <https://iceberg.apache.org/spec/#snapshots>. I have a few questions on >> compaction. >> >> Is this (replace) referring to this RewriteFiles API >> <https://iceberg.apache.org/javadoc/master/org/apache/iceberg/RewriteFiles.html> >> ? >> If so, it looks like it only applies to the most recent snapshot of data? >> Is there a way to compact data belonging to old snapshots? e.g., if I want >> to rewrite data for older data with newer partition spec? >> >> Thanks for the help in advance. >> >> -- >> Chen Song >> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- Chen Song