thomas-pfeiffer opened a new issue, #3168:
URL: https://github.com/apache/iceberg-python/issues/3168

   ### Apache Iceberg version
   
   0.11.0 (latest release)
   
   ### Please describe the bug 🐞
   
   We have a table with a column containing large JSON strings (basically think 
of large `pydantic` validation results). The resulting datafile in Iceberg 
itself is a roughly 6MB (GZip compressed) Parquet file, but when querying it, 
the memory consumption goes to ~4GB. (Just for a few seconds, but long enough 
to cause out-of-memory issues on some systems.)
   
   The issue is that by default `pyarrow` loads the strings per row into 
memory, which blows up the memory. If we download the datafile and open it 
directly via `pyarrow` this behaviour can be reproduced. 
   
   There are 2 workarounds in `pyarrow`: 1. Don't load the problematic column 
(given that is possible in your use case) and 2. switch to dictionary-encoding 
for set column (example snippet below).
   
   ```py
   from pyarrow.parquet import read_table
   table = read_table("datafile.parquet", 
read_dictionary=["problematic_column"])
   ```
   
   Issue in `pyiceberg`:
   Regardless if you use `table.scan(...).to_arrow()` or 
`table.scan(...).to_arrow_batch_reader()`, `pyiceberg` has afaik currently no 
option to specify the dictionary encoding for certain tables, hence `pyarrow` 
uses the default encoding and the memory usage explodes.
   
   The `to_arrow_batch_reader` does not help here either, because  -as per my 
understanding- in the batch reader of `pyiceberg` each batch represents an 
individual datafile. Hence, if there is one problematic 6MB datafile, it makes 
no difference if you use the batch reader or not. I also have the impression 
that when you iterate over the reader, `pyarrow` has already loaded the parquet 
file in a separate thread and this is where the memory explosion actually 
happens.
   
   So the current only workaround in `pyiceberg` is option 1: Don't load the 
problematic column by specifying the `selected_fields`:
   
   ```py
   from pyiceberg.catalog import load_catalog
   
   catalog = load_catalog("default")
   table = catalog.load_table("your_table")
   reader = table.scan(selected_fields=("all", "other", 
"columns")).to_arrow_batch_reader()
   for batch in reader:
   ...
   ```
   
   Expected behaviour:
   There should be an option somewhere, e.g. in the data_scan to specify for 
which columns  dictionary encoding should be used. This option should be 
forwarded to `pyarrow` internally somehow, so that `pyarrow` uses less memory.
   
   Remark:
   I would not change the default behaviour. It would be just good to have the 
option to configure the encoding in pyarrow when needed.
   
   This issue is a follow up for 
https://github.com/apache/iceberg-python/issues/1205
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [ ] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [x] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to