jonded94 opened a new issue, #44599:
URL: https://github.com/apache/arrow/issues/44599

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I have a ~1.5TiB, ~1.7k files parquet dataset with an additional 
`_metadata.parquet` file containing metadata of all row groups. The `_metadata` 
file was written with the mechanism described in the 
[documentation](https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files).
   
   The `_metadata` file is ~390MiB, the 1.7k parquet files are around 900MiB 
each.
   
   I have a script called `read_metadata.py` which can be used to iterate 
through all files in the dataset, get their metadata and simultaneously measure 
memory load (RSS):
   ```
   import gc
   import time
   import json
   from contextlib import contextmanager
   from pathlib import Path
   
   import psutil
   import pyarrow
   import pyarrow.parquet
   
   process = psutil.Process()
   
   @contextmanager
   def profiling(name: str):
       start = time.monotonic()
       start_mem = process.memory_info().rss / 1024**2
       yield
       end = time.monotonic()
       end_mem = process.memory_info().rss / 1024**2
       duration = end - start
       if end_mem - start_mem == 0:
           return
       print(
           f"{name}\n"
           f" took {duration:.5f} s, "
           f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, 
end: {end_mem:.3f}MiB]"
       )
   
   
   def read_metadata(path: Path) -> None:
       pyarrow.parquet.read_metadata(path)
       return
   
   
   if __name__ == "__main__":
       import argparse
       import random
   
       parser = argparse.ArgumentParser()
       parser.add_argument("files", nargs="+")
       parser.add_argument("repeats", type=int)
       args = parser.parse_args()
   
       paths = args.files
       repeats = args.repeats
   
       paths = paths * repeats
       random.shuffle(paths)
   
       for path in paths:
           with profiling(f"load: {path}"):
               read_metadata(path)
   
           with profiling(f"gc:   {path}"):
               gc.collect()
   ```
   
   Doing that gives these results (note that only steps where memory load 
changes are printed):
   ```
   $ python scripts/read_metadata.py repartition/* 3
   load: repartition/part-52.parquet
    took 0.00132 s, mem diff 1.500MiB [start: 194.281MiB, end: 195.781MiB]
   load: repartition/_metadata.parquet
    took 2.38347 s, mem diff 2082.062MiB [start: 195.781MiB, end: 2277.844MiB]
   load: repartition/_metadata.parquet
    took 2.15518 s, mem diff 16.094MiB [start: 2277.844MiB, end: 2293.938MiB]
   load: repartition/part-1587.parquet
    took 0.00099 s, mem diff 1.500MiB [start: 2293.938MiB, end: 2295.438MiB]
   gc:   repartition/part-1299.parquet
    took 0.00217 s, mem diff -1.832MiB [start: 2295.438MiB, end: 2293.605MiB]
   load: repartition/_metadata.parquet
    took 2.15186 s, mem diff 0.562MiB [start: 2293.605MiB, end: 2294.168MiB]
   gc:   repartition/part-1112.parquet
    took 0.00220 s, mem diff -1.543MiB [start: 2294.168MiB, end: 2292.625MiB]
   load: repartition/part-751.parquet
    took 0.00111 s, mem diff 1.500MiB [start: 2292.625MiB, end: 2294.125MiB]
   load: repartition/part-321.parquet
    took 0.00113 s, mem diff 1.500MiB [start: 2294.125MiB, end: 2295.625MiB]
   
   ```
   
![image](https://github.com/user-attachments/assets/dcece3ae-9d38-4e95-957b-b0d9f558f2bf)
   
   Memory load stays mostly constant, but as soon as the `metadata.parquet` 
file is read, a huge, over 2GiB large memory leak appears. Reading that 
particular file multiple times does *not* lead to multiple memory leaks.
   
   There seems to be no way to reduce memory load back to normal levels again, 
not even a `pool = pyarrow.default_memory_pool(); pool.release_unused()` or 
`gc.collect()` does help.
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to