jonded94 opened a new issue, #44599: URL: https://github.com/apache/arrow/issues/44599
### Describe the bug, including details regarding any error messages, version, and platform. I have a ~1.5TiB, ~1.7k files parquet dataset with an additional `_metadata.parquet` file containing metadata of all row groups. The `_metadata` file was written with the mechanism described in the [documentation](https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files). The `_metadata` file is ~390MiB, the 1.7k parquet files are around 900MiB each. I have a script called `read_metadata.py` which can be used to iterate through all files in the dataset, get their metadata and simultaneously measure memory load (RSS): ``` import gc import time import json from contextlib import contextmanager from pathlib import Path import psutil import pyarrow import pyarrow.parquet process = psutil.Process() @contextmanager def profiling(name: str): start = time.monotonic() start_mem = process.memory_info().rss / 1024**2 yield end = time.monotonic() end_mem = process.memory_info().rss / 1024**2 duration = end - start if end_mem - start_mem == 0: return print( f"{name}\n" f" took {duration:.5f} s, " f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]" ) def read_metadata(path: Path) -> None: pyarrow.parquet.read_metadata(path) return if __name__ == "__main__": import argparse import random parser = argparse.ArgumentParser() parser.add_argument("files", nargs="+") parser.add_argument("repeats", type=int) args = parser.parse_args() paths = args.files repeats = args.repeats paths = paths * repeats random.shuffle(paths) for path in paths: with profiling(f"load: {path}"): read_metadata(path) with profiling(f"gc: {path}"): gc.collect() ``` Doing that gives these results (note that only steps where memory load changes are printed): ``` $ python scripts/read_metadata.py repartition/* 3 load: repartition/part-52.parquet took 0.00132 s, mem diff 1.500MiB [start: 194.281MiB, end: 195.781MiB] load: repartition/_metadata.parquet took 2.38347 s, mem diff 2082.062MiB [start: 195.781MiB, end: 2277.844MiB] load: repartition/_metadata.parquet took 2.15518 s, mem diff 16.094MiB [start: 2277.844MiB, end: 2293.938MiB] load: repartition/part-1587.parquet took 0.00099 s, mem diff 1.500MiB [start: 2293.938MiB, end: 2295.438MiB] gc: repartition/part-1299.parquet took 0.00217 s, mem diff -1.832MiB [start: 2295.438MiB, end: 2293.605MiB] load: repartition/_metadata.parquet took 2.15186 s, mem diff 0.562MiB [start: 2293.605MiB, end: 2294.168MiB] gc: repartition/part-1112.parquet took 0.00220 s, mem diff -1.543MiB [start: 2294.168MiB, end: 2292.625MiB] load: repartition/part-751.parquet took 0.00111 s, mem diff 1.500MiB [start: 2292.625MiB, end: 2294.125MiB] load: repartition/part-321.parquet took 0.00113 s, mem diff 1.500MiB [start: 2294.125MiB, end: 2295.625MiB] ```  Memory load stays mostly constant, but as soon as the `metadata.parquet` file is read, a huge, over 2GiB large memory leak appears. Reading that particular file multiple times does *not* lead to multiple memory leaks. There seems to be no way to reduce memory load back to normal levels again, not even a `pool = pyarrow.default_memory_pool(); pool.release_unused()` or `gc.collect()` does help. ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
