Tommo56700 commented on issue #2407:
URL: 
https://github.com/apache/iceberg-python/issues/2407#issuecomment-3836489988

   `to_arrow_batch_reader()` is quite misleading in how it is documented:
   
   > For large results, using a RecordBatchReader requires less memory than 
loading an Arrow Table for the same DataScan, because a RecordBatch is read one 
at a time.
   
   
https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1759-L1803
   
   Internally, `to_record_batches()` is called on the first iterated record 
batch. I see memory spike to 40GB after the first iteration. Internally, you 
can see that `def batches_for_task(task: FileScanTask) -> 
list[pa.RecordBatch]:` eagerly materialises a list of record batches, and this 
is done in parallel based on the number of pyiceberg workers you have set 
(defaults to number of cores). I think in reality number of cores correlate 
directly to the number of data files that are open at once (terrible design). I 
set `export PYICEBERG_MAX_WORKERS=2` and I see my max memory decrease 
dramatically, and my duration to process all record batches was unchanged.
   
   This is a very poor and misleading design for an interface that should be 
aimed at helping you reduce memory usage.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to