syun64 commented on PR #807: URL: https://github.com/apache/iceberg-python/pull/807#issuecomment-2163182848
> To summarize the discussion in https://github.com/apache/iceberg-python/issues/791, we could always benefit from reading data as `large_*` type since offset is 64-bit. Yes, that's how I understand it too. There are benefits to using `large_*` types in memory, so we can decouple the motivation for storing data in memory as large types from that of writing large types, even if our only supported (in PyIceberg) file type doesn't support writing large data yet. > For parquet, we will still write data in non large type due to parquet's 2GB data size limitation. I think it actually won't matter either way because we will get an error when we either try to down cast the type to the smaller type, or try to write the parquet file when we have actually large data in the table. I've updated to_requested_schema function to always cast to large types even on write for simplicity. > Just to confirm my understanding, since the major difference between `large_binary` and `binary` is the offset type (64-bit versus 32-bit), there will be no significant increase in memory usage when reading data as `large_binary`. Yes that's right. I've added a comment as you've suggested 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org