khjoshi94 commented on issue #15427:
URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3954515986

   @RussellSpitzer 
   Thanks for your response. Please let me restate the point more precisely, 
since my earlier example may have introduced some confusion. The earlier 
example was meant to illustrate cross‑engine differences in behavior, not to 
suggest that Athena treats "foo" and "foo " as equal. All engines, including 
Spark and Athena, apply strict string equality during partition pruning, so 
"foo" does not match "foo " anywhere.
   
   The underlying issue I’m trying to highlight is that Iceberg writes string 
partition values exactly as they appear in the incoming row. If a value 
contains trailing whitespace, that exact value is stored in the manifest. 
Because all engines use strict equality for partition pruning, a user filter 
like `batch_date = '20240201'` will not match a stored value like `'20240201 '` 
even though the data logically belongs to that partition.
   
   I also appreciate your earlier point about Iceberg community decision to not 
sanitize user input. I agree with that perspective. My view here is that 
normalizing string partition values at write time helps ensure that strict 
equality behaves consistently and avoids silent mis‑pruning across all engines. 
From that angle, this change still provides some value by preventing a 
comparatively not so obvious correctness issue which at times may be tricky to 
troubleshoot.
   
   Thanks again for your prompt responses.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to