Fokko commented on code in PR #1743:
URL: https://github.com/apache/iceberg-python/pull/1743#discussion_r1979125483
##########
pyiceberg/io/pyarrow.py:
##########
@@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO,
table_metadata: TableMetadata, file_
f"Cannot add file {file_path} because it has field IDs.
`add_files` only supports addition of files without field_ids"
)
schema = table_metadata.schema()
- _check_pyarrow_schema_compatible(schema,
parquet_metadata.schema.to_arrow_schema())
+ if check_schema:
+ _check_pyarrow_schema_compatible(schema,
parquet_metadata.schema.to_arrow_schema())
statistics = data_file_statistics_from_parquet_metadata(
parquet_metadata=parquet_metadata,
stats_columns=compute_statistics_plan(schema,
table_metadata.properties),
parquet_column_mapping=parquet_path_to_id_mapping(schema),
+ check_schema=check_schema,
)
+ if partition_deductor is None:
+ partition = statistics.partition(table_metadata.spec(),
table_metadata.schema())
+ else:
+ partition = partition_deductor(file_path)
Review Comment:
While you can add keys to the `Record`, it is looked up by position, based
on the Schema that belongs to it (in this case, the one of the active
PartitionSpec.
##########
pyiceberg/io/pyarrow.py:
##########
@@ -2475,18 +2484,25 @@ def parquet_files_to_data_files(io: FileIO,
table_metadata: TableMetadata, file_
f"Cannot add file {file_path} because it has field IDs.
`add_files` only supports addition of files without field_ids"
)
schema = table_metadata.schema()
- _check_pyarrow_schema_compatible(schema,
parquet_metadata.schema.to_arrow_schema())
+ if check_schema:
+ _check_pyarrow_schema_compatible(schema,
parquet_metadata.schema.to_arrow_schema())
Review Comment:
At Iceberg, we're pretty concerned at making sure that everything is
compatible at write time. Instead, we could also change the
`_check_pyarrow_schema_compatible` to allow for additional columns in the
Parquet column.
It is okay to skip `optional` columns but not `required` ones.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]