Declow opened a new issue, #2325: URL: https://github.com/apache/iceberg-python/issues/2325
### Apache Iceberg version None ### Please describe the bug 🐞 It seems like there is a memory leak in the avro/reader.py I have a long running service that keeps crashing. I tried to replicate the issue locally and it seems it also has this issue. The following code creates a Memory catalog and generates some random data for ingestion into iceberg. ``` from pyiceberg.catalog.memory import InMemoryCatalog import tracemalloc from datetime import datetime, timezone import polars as pl def generate_df(): df = pl.DataFrame( { "event_type": ["playback"] * 1000, "event_origin": ["origin1"] * 1000, "event_send_at": [datetime.now(timezone.utc)] * 1000, "event_saved_at": [datetime.now(timezone.utc)] * 1000, "data": [ { "calendarKey": "calendarKey", "id": str(i), "referenceId": f"ref-{i}", } for i in range(1000) ], } ) return df df = generate_df() catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg") catalog.create_namespace("default") df = generate_df() catalog = InMemoryCatalog("default", warehouse="/tmp/iceberg") catalog.create_namespace("default") table = iceberg_table = catalog.create_table( "default.leak", schema=df.to_arrow().schema, location="/tmp/iceberg/leak" ) df = pl.DataFrame() tracemalloc.start() for i in range(1000): df = generate_df() df.write_iceberg(table, mode="append") snapshot = tracemalloc.take_snapshot() top_stats = snapshot.statistics("lineno") for stat in top_stats[:10]: print(stat) ``` Slowly but steadily the outputs for the avro reader memory size increases > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=370 KiB, count=3782, average=100 B > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=222 KiB, count=1891, average=120 B > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=184 KiB, count=5673, average=33 B After some more writes the output looks like this > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:330: size=420 KiB, count=4290, average=100 B > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:190: size=251 KiB, count=2145, average=120 B > /Users/dits/git/play-recommendation-input-consumer/.venv/lib/python3.11/site-packages/pyiceberg/avro/reader.py:133: size=208 KiB, count=6435, average=33 B If we take a look at the AvroFile class it uses the __enter__ and __exit__ dunder methods. The enter method assigns the reader to a variable on the instance but it seems like the different reader classes sticks around. https://github.com/apache/iceberg-python/blob/main/pyiceberg/avro/file.py#L192 ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [x] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org