khjoshi94 commented on issue #15427: URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3957233472
@RussellSpitzer Apologies for late response. Thanks for the follow‑up. Let me clarify the engine point, since I think my earlier comment may have been interpreted differently than intended. To answer your question > I still don't see an example of two engines treating this differently. Do you have an example of an engine which automatically trims all string values? I do not have an example of an engine that automatically trims string partition values. Spark and Athena both apply strict string equality during partition pruning, so `"20240201"` does not match `"20240201 "` in any of them. My earlier point wasn’t meant to suggest otherwise, but rather to highlight differences in **observed** behavior when users encounter this issue across engines, not differences in the equality semantics themselves. The core issue I was trying to address is that Iceberg writes string partition values exactly as they appear in the incoming row. If a value contains trailing whitespace, that exact value is stored in the manifest. Because all engines use strict equality, a user filter like: ```sql WHERE batch_date = '20240201' ``` will not match a stored value like: `"20240201 "` even though, from the user’s perspective, the data “belongs” to that logical partition. This leads to silent mis‑pruning and empty reads, which can be difficult to diagnose. I fully understand and respect the concern about Iceberg not sanitizing or modifying user input. If normalizing string partition values at write time isn’t aligned with the project’s philosophy, I don’t want to push that approach further. That said, would like to propose may be having something like an opt‑in table property. A property like: ```sql write.partition.string-normalization=trim ``` would only apply normalization when the table owner explicitly enables it. This keeps Iceberg’s default behavior unchanged, avoids implicit sanitization, and preserves backward compatibility, while still giving users a way to avoid silent mis‑pruning if they choose. If even an opt‑in property isn’t appropriate, I’d be happy to focus instead on documentation improvements and guidance for engine‑level writer‑side transforms. Happy to follow whichever direction the maintainers community feel is most appropriate. Thanks again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
