sungwy commented on code in PR #1036:
URL: https://github.com/apache/iceberg-python/pull/1036#discussion_r1714391548
##########
pyiceberg/table/__init__.py:
##########
@@ -630,7 +630,20 @@ def add_files(self, file_paths: List[str],
snapshot_properties: Dict[str, str] =
Raises:
FileNotFoundError: If the file does not exist.
+ ValueError: Raises a ValueError given file_paths contains
duplicate files
+ ValueError: Raises a ValueError given file_paths already
referenced by table
"""
+ if len(file_paths) != len(set(file_paths)):
+ raise ValueError("File paths must be unique")
+
+ import pyarrow.compute as pc
+
+ expr = pc.field("file_path").isin(file_paths)
+ referenced_files = [file["file_path"] for file in
self._table.inspect.files().filter(expr).to_pylist()]
+
+ if referenced_files:
+ raise ValueError(f"Cannot add files that are already referenced by
table, files: {', '.join(referenced_files)}")
Review Comment:
Just one more suggestion here @amitgilad3 - should we make this behavior the
default, but have a boolean flag available to disable it in the `add_files` API?
A similar [spark
procedure](https://iceberg.apache.org/docs/1.5.1/spark-procedures/#add_files)
has an equivalent flag `check_duplicate_files` and I feel that it could be
useful because inspecting the files table requires us to read all of the active
manifest files. If the user already knows that they won't be running into this
issue, I think it would be useful for them to be able to disable this check.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]