Re: [I] When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow [iceberg-python]

via GitHub Fri, 20 Mar 2026 11:27:55 -0700


kevinjqliu commented on issue #3168:
URL: 
https://github.com/apache/iceberg-python/issues/3168#issuecomment-4100157344


   Thanks for reporting this! 
   
   > The to_arrow_batch_reader does not help here either, because -as per my 
understanding- in the batch reader of pyiceberg each batch represents an 
individual datafile. Hence, if there is one problematic 6MB datafile, it makes 
no difference if you use the batch reader or not. I also have the impression 
that when you iterate over the reader, pyarrow has already loaded the parquet 
file in a separate thread and this is where the memory explosion actually 
happens.
   
   This is a bug. I've found out about this recently (see 
https://github.com/apache/iceberg-python/discussions/3122). 
`to_arrow_batch_reader` should read from a single parquet file 1 batch at a 
time; thus reducing the memory consumption. 
   
   https://github.com/apache/iceberg-python/pull/2676 is the proper fix to 
correct this behavior. Would love to see if this helps with your issue. 
   
   
   > There should be an option somewhere, e.g. in the data_scan to specify for 
which columns dictionary encoding should be used. This option should be 
forwarded to pyarrow internally somehow, so that pyarrow uses less memory.
   
   I think thats a reasonable feature request. Ive opened 
https://github.com/apache/iceberg-python/issues/3170 to track this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] When pyiceberg loads Iceberg tables containing large JSON data into memory, memory usage explodes in pyarrow [iceberg-python]

Reply via email to