Re: [PR] URL-encode partition field names in file locations [iceberg-python]

via GitHub Mon, 23 Dec 2024 11:02:12 -0800


kevinjqliu commented on PR #1457:
URL: https://github.com/apache/iceberg-python/pull/1457#issuecomment-2560171094


   Thanks for the PR! I've dug into the test failure a bit. Heres what I found. 
   
   There's a subtle difference between `PartitionKey.partition` and 
`DataFile.partition`. In most cases, these are the same value. For strings with 
special characters, `DataFile.partition` is sanitized but 
`PartitionKey.partition` is not. 
   
   `DataFile.partition` is sanitized according to 
[apache/iceberg/#10120](https://github.com/apache/iceberg/issues/10120) this is 
to match the column value stored in the underlying parquet file. 
   `PartitionKey.partition` [uses the value from the PartitionSpec which stores 
the un-sanitized 
value](https://github.com/apache/iceberg-python/blob/b450c1c482a615cbb62cabe88ffaca04fb3f7376/pyiceberg/partitioning.py#L391-L395).
 
   
   You can verify this by looking up the table partition spec. 
   ```
   iceberg_table.metadata.spec()
   iceberg_table.metadata.specs()
   ```
   
   The integration test assumes that the value for `PartitionKey.partition` and 
`DataFile.partition` is the same. 
   One possible solution is to sanitize the given `Record` before comparison
   
   After `spark_path_for_justification`, 
   ```
           # Special characters in partition value are sanitized when written 
to the data file's partition field
           # Use `make_compatible_name` to match the sanitize behavior
           sanitized_record = Record(**{make_compatible_name(k): v for k, v in 
vars(expected_partition_record).items()})
           assert spark_partition_for_justification == sanitized_record
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] URL-encode partition field names in file locations [iceberg-python]

Reply via email to