subject:"\[I\] Discussion\: Next steps \/ requirements to support `append` files \[iceberg\-rust\]"

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-25 Thread via GitHub



viirya commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2077637850

   Thanks @liurenjie1024. The roadmaps doc looks good to me. I added a few 
items under DataFusion integration. Feel free to modify it. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-23 Thread via GitHub



liurenjie1024 commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2072455753

   I've compiled a 
[doc](https://docs.google.com/document/d/1YncDX-qQ1T9jBGQmJNtRcPU1trRi00cB8eykv5diKw4/edit?usp=sharing)
 for discussing roadmaps and features for iceberg-rust, welcome to share you 
thoughts and feel free to add what's in your mind. cc @viirya @marvinlanhenke 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-16 Thread via GitHub



liurenjie1024 commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2059202270

   I think to implement appending data file, there are two main tasks:
   
   1. Implement transaction api to append data file
   2. Implement file writer to write record batches to parquet files, and 
generate data file structs. 
   
   Currently there is no design or plan for 1, and @ZENOTME is working on 2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-16 Thread via GitHub

Fokko commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2058636547

@marvinlanhenke Sorry for being late to the party here. Appending a file is
rather straightforward, but all the conditions must be met. This is the
high-level way of appending a file:

- Write a Parquet file with the field IDs populated.
- Collect the metrics to populate the statistics in the manifest file. We do
this in PyIceberg
[here](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/io/pyarrow.py#L1433-L1496).
- Write the snapshot following the concept of a fast-append. A normal append
will append the new files to an existing manifest, and a fast-append will write
a new manifest file with the new entries. This is much easier to implement,
since you don't have to worry about [sequence-number inheritance and
such](https://iceberg.apache.org/spec/#sequence-number-inheritance).
- Rewrite the manifest-list to add the newly created manifest.
- Generate a snapshot summary
- Update the metadata. When you are using a traditional catalog like Glue
and Hive, this can be a bit of work. If you use the Iceberg REST catalog, this
is much easier since it is the responsibility of the REST catalog.

> calling the writer to write the DataFile

> I think this is also what the python implementation does. In
Transaction.append, it calls _dataframe_to_data_files to generate DataFiles
based on the pa.Table.

In [PyIceberg we have
`_dataframe_to_data_files`](https://github.com/apache/iceberg-python/blob/49ac3a27794fc12cfb67b29502ba92b429396201/pyiceberg/table/__init__.py#L2683)
that writes out the Arrow table to one or more Parquet files. Then we collect
all the statistics and return a Datafile that can be appended to the table. I
hope in the future that we can push this down to iceberg-rust :)

> If any error happens during generating metadata relation info like
manifest etc., as the writer already wrote DataFiles, should we go to delete
the written DataFiles?

Iceberg Java does this best effort. If it fails, it tries to clean it up,
but it is always possible that this won't happen (Looking at you OOMs). This is
where the maintenance tasks kick in, as @sdd already pointed out.

Talking about prioritization: Things can happen in parallel. For example,
something simpler like updating table properties will make sure that the commit
path is in place. The Snapshot summary generation can be a PR. The same goes
for collecting the column metrics.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-09 Thread via GitHub



sdd commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2044235422

   > This should probably accept a RecordBatch as a param, create a new 
Transaction, and delegates further action to the transaction.
   
   Is there a reason why append wouldn't take a `RecordBatchStream`? It would 
permit us to make appends that are larger than would fit into memory, if the 
underlying IO method (eg multipart upload) supported it. I for one would find 
this useful.
   
   > If any error happens during generating metadata relation info like 
manifest etc., as the writer already wrote DataFiles, should we go to delete 
the written DataFiles?
   
   I think that this becomes the responsibility of the 
https://iceberg.apache.org/docs/latest/maintenance/#delete-orphan-files 
maintenance task, rather than the writer. If we decide that the writer could 
attempt to do this, it should be optional. This would slow down writes in the 
case where there is a lot of write contention.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



viirya commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041597213

   > calling the writer to write the DataFile
   create an instance of MergingSnapshotProducer -> responsible for writing the 
manifest, manifest_list, snapshot_update
   commit -> update_table() on the Catalog with TableUpdate & TableRequirements
   
   If any error happens during generating metadata relation info like manifest 
etc., as the writer already wrote DataFiles, should we go to delete the written 
DataFiles?
   
   > I think your understanding is correct - and I agree if the writer API 
already does the conversion from RecordBatch to DataFile, the Transaction 
shouldn't be concerned with this issue, since it is a higher-level API. 
However, the Transaction calls the writer that writes the actual DataFile, 
which seems reasonable.
   
   I think this is also what the python implementation does. In 
`Transaction.append`, it calls `_dataframe_to_data_files` to generate DataFiles 
based on the `pa.Table`.
   
   > we create a Transaction that basically does two things:
   2.1. It creates a _MergingSnapshotProducer which is (on a high-level) 
responsible for writing a new ManifestList, creating a new Snapshot (returned 
as AddSnaphotUpdate)
   
   Yea, specifically, it is a `FastAppendFiles` for appending files. Although 
the manifest commit logic is actually implemented in `_MergingSnapshotProducer`.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



marvinlanhenke commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041548391

   > I'm not sure whether my understanding is correct: The target of 
`table.append()` is used to insert a batch of data into the table. It's seems 
like a high level API which will use two lower API:
   > 
   > 1. [writer API](https://github.com/apache/iceberg-rust/issues/34) for 
convert RecordBatch to DataFile
   > 2. [transaction 
API](https://github.com/apache/iceberg-rust/blob/ca9de89ac9d95683c8fe9191f72ab922dc4c7672/crates/iceberg/src/transaction.rs#L30)
  for commit the DataFile(update the table metadata)
   > 
   > To separate these two interfaces, I think we don't need to delegate the 
conversion between `RecordBatch` and `DataFile` in the transaction.
   
   I think your understanding is correct - and I agree if the writer API 
already does the conversion from RecordBatch to DataFile, the Transaction 
shouldn't be concerned with this issue, since it is a higher-level API. 
However, the Transaction calls the writer that writes the actual DataFile, 
which seems reasonable. 
   
   So the Transaction `append` (if I understand the py impl correctly) does all 
of those things:
   - calling the writer to write the DataFile
   - create an instance of MergingSnapshotProducer -> responsible for writing 
the manifest, manifest_list, snapshot_update
   - commit -> update_table() on the Catalog with TableUpdate & 
TableRequirements
   
   @ZENOTME 
   Where would the writer API (which I only know from the design spec in #34) 
fit best here? Should a Transaction create a new writer everytime a new 
transaction is created? Or should the Table itself hold a ref to a writer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



ZENOTME commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041528557

   I'm not sure whether my understanding is correct: 
   The target of `table.append()` is used to insert a batch of data into the 
table. It's seems like a high level API which will use two lower API:
   1. [writer API](https://github.com/apache/iceberg-rust/issues/34) for 
convert RecordBatch to DataFile
   2. [transaction 
API](https://github.com/apache/iceberg-rust/blob/ca9de89ac9d95683c8fe9191f72ab922dc4c7672/crates/iceberg/src/transaction.rs#L30)
  for commit the DataFile(update the table metadata)
   
   To separate these two interfaces, I think we don't need to delegate the 
conversion between `RecordBatch` and `DataFile` in the transaction. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

2024-04-07 Thread via GitHub



sdd commented on issue #329:
URL: https://github.com/apache/iceberg-rust/issues/329#issuecomment-2041520090

   Thanks for spending the time thinking about this and putting your thoughts 
into words. I need to spend some time re-reading the associated parts of the 
spec and looking through the Java and possibly python implementations before 
being able to comment. I should get chance tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

Re: [I] Discussion: Next steps / requirements to support `append` files [iceberg-rust]

9 matches

Site Navigation

Mail list logo

Footer information