anna-dv commented on issue #3260:
URL: 
https://github.com/apache/iceberg-python/issues/3260#issuecomment-4328520669

   Thanks a lot, @lukaswelsch, for opening the issue!
   
   Here's some additional context which also might be helpful:
   This issue becomes really problematic when working with iceberg tables 
"living" in AWS Athena/Trino if these tables or their parts often get optimized 
for performance through an [Athena's OPTIMIZE 
query](https://docs.aws.amazon.com/athena/latest/ug/optimize-statement.html) 
(or the one [from 
Trino](https://trino.io/docs/current/connector/iceberg.html#optimize)).
   
   Because, as it turned out, if optimization procedure mentioned above finds 
parquet files eligible for optimization, it rewrites them in a way that all 
`dict-encoded string` columns (if there were any) again become a `plain string` 
ones. It also seems like there is no way to influence this optimization 
behavior as of now (or at least, I'm not aware of it). So any table created 
using `dictionary_encode()` and being continuously optimized for performance in 
Athena/Trino inevitably enters that mixed `dict-encoded string + plain string` 
at some point and the incremental scans start failing. 
   
   The interesting thing here is though, before seeing the error message from 
the issue above (`pyarrow.lib.ArrowTypeError: ...`), one also gets a warning 
from pyiceberg that [Iceberg does not have a dictionary type. <class 
'pyarrow.lib.DictionaryType'> will be inferred as string on 
read.](https://github.com/apache/iceberg-python/blob/8dee48a8e0218353f706133ed035334869a7ee12/pyiceberg/io/pyarrow.py#L1222),
 but seems like this inference is never happening and the read still fails 
afterwards.
   
   Looking forward to the updates on this one and would also be happy to help 
if needed! ✌️


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to