khjoshi94 commented on issue #15427: URL: https://github.com/apache/iceberg/issues/15427#issuecomment-3949421592
Thank you very much @RussellSpitzer for sharing your thoughts so fast. Regarding your comment: > Could you also explain how there is different behavior on the read side in different engines? I'm not sure I followed from your example Sure. Apologies if it is not clear enough. To my understanding, Iceberg stores partition values exactly as they appear in the incoming row. If a value contains trailing whitespace (e.g., `"20240201 "`), that exact value is written into the manifest. `Spark` use strict string equality during partition pruning. So `"20240201"` does not match `"20240201 "` and the file is pruned. This results in empty reads or empty joins. Also to what I found and understood is that, Athena’s partition filtering behavior is not identical to Spark because it relies on AWS Glue’s metadata and evaluation rules, which do not always apply strict byte‑for‑byte string equality. This could lead to Athena matching values that Spark would prune. Based on these findings and observations noted above, this may lead to inconsistent behavior across engines i.e. the same Iceberg table returns different results depending on where it is queried. Normalizing string partition values at write time ensures that all engines see the same canonical value and apply the same pruning logic. Hope I was able to clarify based on my limited understanding. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
