kevinjqliu commented on PR #1457: URL: https://github.com/apache/iceberg-python/pull/1457#issuecomment-2560171094
Thanks for the PR! I've dug into the test failure a bit. Heres what I found. There's a subtle difference between `PartitionKey.partition` and `DataFile.partition`. In most cases, these are the same value. For strings with special characters, `DataFile.partition` is sanitized but `PartitionKey.partition` is not. `DataFile.partition` is sanitized according to [apache/iceberg/#10120](https://github.com/apache/iceberg/issues/10120) this is to match the column value stored in the underlying parquet file. `PartitionKey.partition` [uses the value from the PartitionSpec which stores the un-sanitized value](https://github.com/apache/iceberg-python/blob/b450c1c482a615cbb62cabe88ffaca04fb3f7376/pyiceberg/partitioning.py#L391-L395). You can verify this by looking up the table partition spec. ``` iceberg_table.metadata.spec() iceberg_table.metadata.specs() ``` The integration test assumes that the value for `PartitionKey.partition` and `DataFile.partition` is the same. One possible solution is to sanitize the given `Record` before comparison After `spark_path_for_justification`, ``` # Special characters in partition value are sanitized when written to the data file's partition field # Use `make_compatible_name` to match the sanitize behavior sanitized_record = Record(**{make_compatible_name(k): v for k, v in vars(expected_partition_record).items()}) assert spark_partition_for_justification == sanitized_record ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
