Re: [PR] Manual deduction of partitions [iceberg-python]

via GitHub Tue, 04 Mar 2025 02:36:40 -0800


Fokko commented on code in PR #1743:
URL: https://github.com/apache/iceberg-python/pull/1743#discussion_r1979125483



##########
pyiceberg/io/pyarrow.py:
##########
@@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO, 
table_metadata: TableMetadata, file_
                 f"Cannot add file {file_path} because it has field IDs. 
`add_files` only supports addition of files without field_ids"
             )
         schema = table_metadata.schema()
-        _check_pyarrow_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())
+        if check_schema:
+            _check_pyarrow_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())
 
         statistics = data_file_statistics_from_parquet_metadata(
             parquet_metadata=parquet_metadata,
             stats_columns=compute_statistics_plan(schema, 
table_metadata.properties),
             parquet_column_mapping=parquet_path_to_id_mapping(schema),
+            check_schema=check_schema,
         )
+        if partition_deductor is None:
+            partition = statistics.partition(table_metadata.spec(), 
table_metadata.schema())
+        else:
+            partition = partition_deductor(file_path)

Review Comment:
   While you can add keys to the `Record`, it is looked up by position, based 
on the Schema that belongs to it (in this case, the one of the active 
PartitionSpec.



##########
pyiceberg/io/pyarrow.py:
##########
@@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO, 
table_metadata: TableMetadata, file_
                 f"Cannot add file {file_path} because it has field IDs. 
`add_files` only supports addition of files without field_ids"
             )
         schema = table_metadata.schema()
-        _check_pyarrow_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())
+        if check_schema:
+            _check_pyarrow_schema_compatible(schema, 
parquet_metadata.schema.to_arrow_schema())

Review Comment:
   At Iceberg, we're pretty concerned at making sure that everything is 
compatible at write time. Instead, we could also change the 
`_check_pyarrow_schema_compatible` to allow for additional columns in the 
Parquet column.
   
   It is okay to skip `optional` columns but not `required` ones.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Manual deduction of partitions [iceberg-python]

Reply via email to