kevinjqliu commented on issue #3168: URL: https://github.com/apache/iceberg-python/issues/3168#issuecomment-4100157344
Thanks for reporting this! > The to_arrow_batch_reader does not help here either, because -as per my understanding- in the batch reader of pyiceberg each batch represents an individual datafile. Hence, if there is one problematic 6MB datafile, it makes no difference if you use the batch reader or not. I also have the impression that when you iterate over the reader, pyarrow has already loaded the parquet file in a separate thread and this is where the memory explosion actually happens. This is a bug. I've found out about this recently (see https://github.com/apache/iceberg-python/discussions/3122). `to_arrow_batch_reader` should read from a single parquet file 1 batch at a time; thus reducing the memory consumption. https://github.com/apache/iceberg-python/pull/2676 is the proper fix to correct this behavior. Would love to see if this helps with your issue. > There should be an option somewhere, e.g. in the data_scan to specify for which columns dictionary encoding should be used. This option should be forwarded to pyarrow internally somehow, so that pyarrow uses less memory. I think thats a reasonable feature request. Ive opened https://github.com/apache/iceberg-python/issues/3170 to track this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
