[ https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-8801: ----------------------------------- Assignee: Wes McKinney > [Python] Memory leak on read from parquet file with UTC timestamps using > pandas > ------------------------------------------------------------------------------- > > Key: ARROW-8801 > URL: https://issues.apache.org/jira/browse/ARROW-8801 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.16.0, 0.17.0 > Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, > mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, > ubuntu 20.04 (linux). > Reporter: Rauli Ruohonen > Assignee: Wes McKinney > Priority: Blocker > Fix For: 1.0.0 > > > Given dump.py script > > {code:java} > import pandas as pd > import numpy as np > x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', > utc=True) > pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', > compression=None) > {code} > and load.py script > > {code:java} > import sys > import pandas as pd > def foo(engine): > for _ in range(2**9): > pd.read_parquet('data.parquet', engine=engine) > print('Done') > input() > foo(sys.argv[1]) > {code} > running first "python dump.py" and then "python load.py pyarrow", on my > machine python memory usage stays at 4+ GB while it waits for input. If using > "python load.py fastparquet" instead, it is about 100 MB, so it should be a > pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is > removed from dump.py, in which case the timestamp is timezone-unaware. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)