khjoshi94 commented on issue #15427:
URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3957233472

   @RussellSpitzer Apologies for late response.
   
   Thanks for the follow‑up. Let me clarify the engine point, since I think my 
earlier comment may have been interpreted differently than intended.
   
   To answer your question
   > I still don't see an example of two engines treating this differently. Do 
you have an example of an engine which automatically trims all string values?
   
   I do not have an example of an engine that automatically trims string 
partition values. Spark and Athena both apply strict string equality during 
partition pruning, so `"20240201"` does not match `"20240201 "` in any of them. 
My earlier point wasn’t meant to suggest otherwise, but rather to highlight 
differences in **observed** behavior when users encounter this issue across 
engines, not differences in the equality semantics themselves.
   
   The core issue I was trying to address is that Iceberg writes string 
partition values exactly as they appear in the incoming row. If a value 
contains trailing whitespace, that exact value is stored in the manifest. 
Because all engines use strict equality, a user filter like:
   
   ```sql
   WHERE batch_date = '20240201'
   ```
   will not match a stored value like: `"20240201 "` even though, from the 
user’s perspective, the data “belongs” to that logical partition. This leads to 
silent mis‑pruning and empty reads, which can be difficult to diagnose.
   
   I fully understand and respect the concern about Iceberg not sanitizing or 
modifying user input. If normalizing string partition values at write time 
isn’t aligned with the project’s philosophy, I don’t want to push that approach 
further.
   
   That said, would like to propose may be having something like an opt‑in 
table property. A property like:
   ```sql
   write.partition.string-normalization=trim
   ```
   would only apply normalization when the table owner explicitly enables it. 
This keeps Iceberg’s default behavior unchanged, avoids implicit sanitization, 
and preserves backward compatibility, while still giving users a way to avoid 
silent mis‑pruning if they choose.
   
   If even an opt‑in property isn’t appropriate, I’d be happy to focus instead 
on documentation improvements and guidance for engine‑level writer‑side 
transforms.
   
   Happy to follow whichever direction the maintainers community feel is most 
appropriate. Thanks again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to