marvinlanhenke opened a new issue, #329:
URL: https://github.com/apache/iceberg-rust/issues/329

   ...out of curiosity, I took a closer look at the pyiceberg impl and how the 
`Table.append()` works.
    
   Now, I would like to pick your brain, in order to understand and track the 
next steps we have to take to support `append` as well (since we should be 
getting close to having write support). The goal here is, to extract and create 
actionable issues.
   
   Here is what I understand from the python impl so far (high-level):
   ---
   1. we call `append()` on the Table class with our DataFrame: pa.Table and 
the snaphot_properties: Dict[str, str]
   2. we create a `Transaction` that basically does two things:
   2.1. It creates a `_MergingSnapshotProducer` which is (on a high-level) 
responsible for writing a new ManifestList, creating a new Snapshot (returned 
as AddSnaphotUpdate)
   2.2 It calls `update_table` on the respective Catalog which creates a new 
metadata.json and returns the new metadata as well as the new metadata_location
   
   
[pyiceberg-link](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1314)
   
   Here is what I think we need to implement (rough sketch):
   ---
   1. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L1314)`fn
 append(...)` on `struct Table`:
   This should probably accept a RecordBatch as a param, create a new 
`Transaction`, and delegates further action to the transaction.
   2. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L362)`fn
 append(...)` on `struct Transaction`:
   Receives RecordBatch and snapshot_properties. Performs validation checks. 
Converts the RecordBatch to a collection of `DataFiles` and creates a 
`_MergingSnapshotProducer` with the collection.
   3. 
[impl](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L2745)`_MergingSnapshotProducer`:
   :: write manifests (added, deleted, existing)
   :: get next_sequence_number from `TableMetadata`
   :: update snapshot summaries 
   :: generate manifest_list_path
   :: write manifest_list
   :: create a new Snapshot
   :: return TableUpdate: AddSnapshot
   4. impl `update_table` on the concrete Catalog implementations
   
   What could be possible Issues here?
   I think we need to start with the `_MergingSnapshotProducer` (possibly split 
into mutliple parts) and work our way up the list?
   Once we have the MergingSnapshotProducer, we can implement the append 
function on Transaction which basically orchestrates?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to