anna-dv commented on issue #3260: URL: https://github.com/apache/iceberg-python/issues/3260#issuecomment-4328520669
Thanks a lot, @lukaswelsch, for opening the issue! Here's some additional context which also might be helpful: This issue becomes really problematic when working with iceberg tables "living" in AWS Athena/Trino if these tables or their parts often get optimized for performance through an [Athena's OPTIMIZE query](https://docs.aws.amazon.com/athena/latest/ug/optimize-statement.html) (or the one [from Trino](https://trino.io/docs/current/connector/iceberg.html#optimize)). Because, as it turned out, if optimization procedure mentioned above finds parquet files eligible for optimization, it rewrites them in a way that all `dict-encoded string` columns (if there were any) again become a `plain string` ones. It also seems like there is no way to influence this optimization behavior as of now (or at least, I'm not aware of it). So any table created using `dictionary_encode()` and being continuously optimized for performance in Athena/Trino inevitably enters that mixed `dict-encoded string + plain string` at some point and the incremental scans start failing. The interesting thing here is though, before seeing the error message from the issue above (`pyarrow.lib.ArrowTypeError: ...`), one also gets a warning from pyiceberg that [Iceberg does not have a dictionary type. <class 'pyarrow.lib.DictionaryType'> will be inferred as string on read.](https://github.com/apache/iceberg-python/blob/8dee48a8e0218353f706133ed035334869a7ee12/pyiceberg/io/pyarrow.py#L1222), but seems like this inference is never happening and the read still fails afterwards. Looking forward to the updates on this one and would also be happy to help if needed! ✌️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
