Declow opened a new issue, #2325:
URL: https://github.com/apache/iceberg-python/issues/2325

   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   It seems like there is a memory leak in the avro/reader.py
   I have a long running service that keeps crashing. I tried to replicate the 
issue locally and it seems it also has this issue.
   
   The following code creates a Memory catalog and generates some random data 
for ingestion into iceberg.
   
   
   ```
   from pyiceberg.catalog.memory import InMemoryCatalog
   import tracemalloc
   from datetime import datetime, timezone
   import polars as pl
   
   def generate_df():
       df = pl.DataFrame(
           {
               "event_type": ["playback"] * 1000,
               "event_origin": ["origin1"] * 1000,
               "event_send_at": [datetime.now(timezone.utc)] * 1000,
               "event_saved_at": [datetime.now(timezone.utc)] * 1000,
               "data": [
                   {
                       "calendarKey": "calendarKey",
                       "id": str(i),
                       "referenceId": f"ref-{i}",
                   }
                   for i in range(1000)
               ],
           }
       )
       return df
   
   df = generate_df()
   catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
   catalog.create_namespace("default")
   
   df = generate_df()
   catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg")
   catalog.create_namespace("default")
   table = iceberg_table = catalog.create_table(
       "default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak"
   )
   
   df = pl.DataFrame()
   
   tracemalloc.start()
   for i in range(1000):
       df = generate_df()
       df.write_iceberg(table, mode="append")
       snapshot = tracemalloc.take_snapshot()
       top_stats = snapshot.statistics("lineno")
       for stat in top_stats[:10]:
           print(stat)
   ```
   
   Slowly but steadily the outputs for the avro reader memory size increases
   
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330:
 size=370 KiB, count=3782, average=100 B
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190:
 size=222 KiB, count=1891, average=120 B
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133:
 size=184 KiB, count=5673, average=33 B
   
   After some more writes the output looks like this
   
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330:
 size=420 KiB, count=4290, average=100 B
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190:
 size=251 KiB, count=2145, average=120 B
   > 
/Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133:
 size=208 KiB, count=6435, average=33 B
   
   If we take a look at the AvroFile class it uses the __enter__ and __exit__ 
dunder methods. The enter method assigns the reader to a variable on the 
instance but it seems like the different reader classes sticks around. 
   
https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192
   
   
   
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to