Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

via GitHub Tue, 16 Apr 2024 02:22:51 -0700


Fokko commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2058636547

@marvinlanhenke Sorry for being late to the party here. Appending a file is
rather straightforward, but all the conditions must be met. This is the
high-level way of appending a file:

- Write a Parquet file with the field IDs populated.
- Collect the metrics to populate the statistics in the manifest file. We do
this in PyIceberg
[here](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/io/pyarrow.py#L1433-L1496).
- Write the snapshot following the concept of a fast-append. A normal append
will append the new files to an existing manifest, and a fast-append will write
a new manifest file with the new entries. This is much easier to implement,
since you don't have to worry about [sequence-number inheritance and
such](https://iceberg.apache.org/spec/#sequence-number-inheritance).
- Rewrite the manifest-list to add the newly created manifest.
- Generate a snapshot summary
- Update the metadata. When you are using a traditional catalog like Glue
and Hive, this can be a bit of work. If you use the Iceberg REST catalog, this
is much easier since it is the responsibility of the REST catalog.

> calling the writer to write the DataFile

> I think this is also what the python implementation does. In
Transaction.append, it calls _dataframe_to_data_files to generate DataFiles
based on the pa.Table.

In [PyIceberg we have
`_dataframe_to_data_files`](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/table/__init__.py#L2683)
that writes out the Arrow table to one or more Parquet files. Then we collect
all the statistics and return a Datafile that can be appended to the table. I
hope in the future that we can push this down to iceberg-rust :)

> If any error happens during generating metadata relation info like
manifest etc., as the writer already wrote DataFiles, should we go to delete
the written DataFiles?

Iceberg Java does this best effort. If it fails, it tries to clean it up,
but it is always possible that this won't happen (Looking at you OOMs). This is
where the maintenance tasks kick in, as @sdd already pointed out.

Talking about prioritization: Things can happen in parallel. For example,
something simpler like updating table properties will make sure that the commit
path is in place. The Snapshot summary generation can be a PR. The same goes
for collecting the column metrics.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Reply via email to