syun64 commented on PR #807:
URL: https://github.com/apache/iceberg-python/pull/807#issuecomment-2163182848

   > To summarize the discussion in 
https://github.com/apache/iceberg-python/issues/791, we could always benefit 
from reading data as `large_*` type since offset is 64-bit. 
   
   Yes, that's how I understand it too. There are benefits to using `large_*` 
types in memory, so we can decouple the motivation for storing data in memory 
as large types from that of writing large types, even if our only supported (in 
PyIceberg) file type doesn't support writing large data yet.
   
   > For parquet, we will still write data in non large type due to parquet's 
2GB data size limitation. 
   
   I think it actually won't matter either way because we will get an error 
when we either try to down cast the type to the smaller type, or try to write 
the parquet file when we have actually large data in the table. I've updated 
to_requested_schema function to always cast to large types even on write for 
simplicity. 
   
   > Just to confirm my understanding, since the major difference between 
`large_binary` and `binary` is the offset type (64-bit versus 32-bit), there 
will be no significant increase in memory usage when reading data as 
`large_binary`.
   
   Yes that's right. I've added a comment as you've suggested 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to