Thanks Ryan. This makes sense.

To re-iterator what you said,
* Compaction can be done with RewriteFiles API and it creates a new table
snapshot with a new set of files and data content unchanged.
* Garbage collection can be done with ExpireSnapshots API and data files
are deleted during the API calls so long as the reference count is 0.
* Both processes are not automatic and need to be executed explicitly.
e.g., there is no automatic expiration of snapshots based on time.

Thanks,
Chen

On Fri, Jun 26, 2020 at 3:13 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Hi Chen,
>
> The "replace" operation indicates that although the files in a table
> changed, the actual table data did not. Queries should produce the same
> results, if they are deterministic. That's why we use it for file
> compaction: although we replace small files with fewer, smaller files, the
> overall contents doesn't change. To your question, yes, that's what the
> RewriteFiles API does.
>
> All operations change the current table state. Each snapshot of a table is
> a complete set of the data files that make up the table, and snapshots are
> immutable. So you can't go back and change a snapshot from yesterday. What
> you can do is replace small files in the current state with a compacted
> large file. That creates a new snapshot that is used from then on. The
> small files are still referenced and available as long as the old snapshot
> exists, which is why snapshots should be cleaned up regularly with
> ExpireSnapshots. That will delete files that are no longer referenced.
>
> We file referenced by old snapshots around for a couple reasons. First,
> readers that started with a different current snapshot may still be reading
> them. Second, it allows you to go back and read the table at an older point
> in time -- time-travel queries.
>
> I hope that helps,
>
> rb
>
> On Fri, Jun 26, 2020 at 10:39 AM Chen Song <chen.song...@gmail.com> wrote:
>
>> Hey
>>
>> In Iceberg documentation, it mentions to use this for compaction
>> <https://iceberg.apache.org/spec/#snapshots>. I have a few questions on
>> compaction.
>>
>> Is this (replace) referring to this RewriteFiles API
>> <https://iceberg.apache.org/javadoc/master/org/apache/iceberg/RewriteFiles.html>
>> ?
>> If so, it looks like it only applies to the most recent snapshot of data?
>> Is there a way to compact data belonging to old snapshots? e.g., if I want
>> to rewrite data for older data with newer partition spec?
>>
>> Thanks for the help in advance.
>>
>> --
>> Chen Song
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Chen Song

Reply via email to