rdblue commented on code in PR #41:
URL: https://github.com/apache/iceberg-python/pull/41#discussion_r1429240582


##########
pyiceberg/table/__init__.py:
##########
@@ -830,6 +884,49 @@ def history(self) -> List[SnapshotLogEntry]:
     def update_schema(self, allow_incompatible_changes: bool = False, 
case_sensitive: bool = True) -> UpdateSchema:
         return UpdateSchema(self, 
allow_incompatible_changes=allow_incompatible_changes, 
case_sensitive=case_sensitive)
 
+    def write_arrow(self, df: pa.Table, mode: Literal['append', 'overwrite'] = 
'overwrite') -> None:

Review Comment:
   I'm wondering what API we want to expose here. Do we want a `write_arrow` 
method or should we just accept a `df` and test whether it is an acceptable 
type?
   
   Spark used a similar API (`write` with a `mode` argument) that we ended up 
replacing with one that is more specific to avoid confusion. In the new API, we 
used more specific verbs like `append`, `overwrite_by_partition`, 
`overwrite_by_filter`, `replace`, etc. One of the problems was that it wasn't 
clear what the behavior of `overwrite` would be: it could replace the schema 
and the data or it could replace only the data. It's a little more clear here 
because this is passing data to the table instance; I think it would be 
reasonable to assume that the table schema is enforced. However, I don't think 
a `mode` is quite enough to cover what to overwrite: whether it is the entire 
table or just parts of it.
   
   What about a slightly different API that uses method names for the verb? 
Something like this:
   
   ```python
   def append(self, df: pa.Table) -> None:
       """Appends the contents of a dataframe to the table"""
       pass
   
   def overwrite(self, df: pa.Table, overwrite_filter: Expression=AlwaysTrue()) 
-> None:
       """Overwrites the data matching overwrite_filter with the contents of a 
dataframe"""
       if overwrite_filter != AlwaysTrue():
           raise NotImplementedError("Cannot overwrite a subset of a table")
       if len(self.spec().fields) > 0:
           raise NotImplementedError("Cannot write to a partitioned table")
       ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to