Fokko commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2058636547

   @marvinlanhenke Sorry for being late to the party here. Appending a file is 
rather straightforward, but all the conditions must be met. This is the 
high-level way of appending a file:
   
   - Write a Parquet file with the field IDs populated.
   - Collect the metrics to populate the statistics in the manifest file. We do 
this in PyIceberg 
[here](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/io/pyarrow.py#L1433-L1496).
   - Write the snapshot following the concept of a fast-append. A normal append 
will append the new files to an existing manifest, and a fast-append will write 
a new manifest file with the new entries. This is much easier to implement, 
since you don't have to worry about [sequence-number inheritance and 
such](https://iceberg.apache.org/spec/#sequence-number-inheritance).
   - Rewrite the manifest-list to add the newly created manifest.
   - Generate a snapshot summary
   - Update the metadata. When you are using a traditional catalog like Glue 
and Hive, this can be a bit of work. If you use the Iceberg REST catalog, this 
is much easier since it is the responsibility of the REST catalog.
   
   > calling the writer to write the DataFile
   
   > I think this is also what the python implementation does. In 
Transaction.append, it calls _dataframe_to_data_files to generate DataFiles 
based on the pa.Table.
   
   In [PyIceberg we have 
`_dataframe_to_data_files`](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/table/__init__.py#L2683)
 that writes out the Arrow table to one or more Parquet files. Then we collect 
all the statistics and return a Datafile that can be appended to the table. I 
hope in the future that we can push this down to iceberg-rust :)
   
   > If any error happens during generating metadata relation info like 
manifest etc., as the writer already wrote DataFiles, should we go to delete 
the written DataFiles?
   
   Iceberg Java does this best effort. If it fails, it tries to clean it up, 
but it is always possible that this won't happen (Looking at you OOMs). This is 
where the maintenance tasks kick in, as @sdd already pointed out.
   
   Talking about prioritization: Things can happen in parallel. For example, 
something simpler like updating table properties will make sure that the commit 
path is in place. The Snapshot summary generation can be a PR. The same goes 
for collecting the column metrics.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to