Fokko commented on issue #791: URL: https://github.com/apache/iceberg-python/issues/791#issuecomment-2159294140
I agree that you cannot write a single field of 2GB+ to a parquet file. In that case, Parquet is probably not the best way of storing such a big blob. The difference between how the offsets are stored. With the large binary, the offsets are 64 longs, and with the binary, they are 32 bits. When we create an array in Arrow: `[foo, bar, arrow]`, then this is stored as: ```python data = 'foobararrow' offsets = [0, 3, 6, 11] ``` If the offsets are 32 bits, then you need to chunk them into smaller buffers, which negatively impacts performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
