Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

via GitHub Thu, 14 Mar 2024 01:36:18 -0700


Fokko commented on code in PR #506:
URL: https://github.com/apache/iceberg-python/pull/506#discussion_r1524358528



##########
pyiceberg/table/__init__.py:
##########
@@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: 
BooleanExpression = ALWAYS_T
                     for data_file in data_files:
                         update_snapshot.append_data_file(data_file)
 
+    def add_files(self, file_paths: List[str]) -> None:
+        """
+        Shorthand API for adding files as data files to the table.
+
+        Args:
+            file_paths: The list of full file paths to be added as data files 
to the table
+        """
+        if any(not isinstance(field.transform, IdentityTransform) for field in 
self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with 
Transform Partitions")

Review Comment:
   We can be more permissive. It isn't a problem the table's current 
partitioning has something different than a `IdentitiyTransform`, the issue is 
that we cannot add DataFiles that use this partitioning (until we find a clever 
way of checking this).



##########
pyiceberg/table/__init__.py:
##########
@@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: 
BooleanExpression = ALWAYS_T
                     for data_file in data_files:
                         update_snapshot.append_data_file(data_file)
 
+    def add_files(self, file_paths: List[str]) -> None:
+        """
+        Shorthand API for adding files as data files to the table.
+
+        Args:
+            file_paths: The list of full file paths to be added as data files 
to the table
+        """

Review Comment:
   It would be great to add a `Raises:` section here indicating which errors to 
expect. For example, when a file cannot be found. In such a case, we want to 
raise a PyIceberg exception, instead of an Arrow specific exception.



##########
pyiceberg/table/__init__.py:
##########
@@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: 
BooleanExpression = ALWAYS_T
                     for data_file in data_files:
                         update_snapshot.append_data_file(data_file)
 
+    def add_files(self, file_paths: List[str]) -> None:
+        """
+        Shorthand API for adding files as data files to the table.
+
+        Args:
+            file_paths: The list of full file paths to be added as data files 
to the table
+        """
+        if any(not isinstance(field.transform, IdentityTransform) for field in 
self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with 
Transform Partitions")
+
+        if self.name_mapping() is None:

Review Comment:
   Technically you don't have to add a name-mapping if the field-IDs are set



##########
pyiceberg/table/__init__.py:
##########
@@ -1147,6 +1150,26 @@ def overwrite(self, df: pa.Table, overwrite_filter: 
BooleanExpression = ALWAYS_T
                     for data_file in data_files:
                         update_snapshot.append_data_file(data_file)
 
+    def add_files(self, file_paths: List[str]) -> None:
+        """
+        Shorthand API for adding files as data files to the table.
+
+        Args:
+            file_paths: The list of full file paths to be added as data files 
to the table
+        """
+        if any(not isinstance(field.transform, IdentityTransform) for field in 
self.metadata.spec().fields):
+            raise NotImplementedError("Cannot add_files to a table with 
Transform Partitions")
+
+        if self.name_mapping() is None:
+            with self.transaction() as tx:
+                tx.set_properties(**{TableProperties.DEFAULT_NAME_MAPPING: 
self.schema().name_mapping.model_dump_json()})
+
+        with self.transaction() as txn:
+            with txn.update_snapshot().fast_append() as update_snapshot:

Review Comment:
   Now with https://github.com/apache/iceberg-python/pull/471 merged, this 
should work in a single transaction. The updated metadata will be passed into 
the UpdateSnapshot class and should pick up the name-mapping.
   
   ```suggestion
               with tx.update_snapshot().fast_append() as update_snapshot:
   ```
   
   I think it is important to have this operation in a single transaction, 
otherwise, the name mapping might be set, and then if a file is missing, it 
will fail and the name-mapping will still be there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP] Add Data Files from Parquet Files [iceberg-python]

Reply via email to