Rauli Ruohonen created ARROW-8801: ------------------------------------- Summary: Pyarrow leaks memory on read from parquet file with UTC timestamps using pandas Key: ARROW-8801 URL: https://issues.apache.org/jira/browse/ARROW-8801 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.0, 0.16.0 Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux). Reporter: Rauli Ruohonen
Given dump.py script {code:java} import pandas as pd import numpy as np x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True) pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', compression=None) {code} and load.py script {code:java} import sys import pandas as pd def foo(engine): for _ in range(2**9): pd.read_parquet('data.parquet', engine=engine) print('Done') input() foo(sys.argv[1]) {code} running first "python dump.py" and then "python load.py pyarrow", on my machine python memory usage stays at 4+ GB while it waits for input. If using "python load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is removed from dump.py, in which case the timestamp is timezone-unaware. -- This message was sent by Atlassian Jira (v8.3.4#803005)