Sergey Mozharov created ARROW-6874:
--------------------------------------
Summary: Memory leak in Table.to_pandas() when nested columns are
present
Key: ARROW-6874
URL: https://issues.apache.org/jira/browse/ARROW-6874
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.0
Environment: Operating system: Windows 10
pyarrow installed via conda
both python environments were identical except pyarrow:
python: 3.6.7
numpy: 1.17.2
pandas: 0.25.1
Reporter: Sergey Mozharov
I upgraded from pyarrow 0.14.1 to 0.15.0 and during some testing my python
interpreter ran out of memory.
I narrowed the issue down to the pyarrow.Table.to_pandas() call, which appears
to have a memory leak in the latest version. See details below to reproduce
this issue.
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
# create a table with one nested array column
nested_array = pa.array([np.random.rand(1000) for i in range(500)])
nested_array.type # ListType(list<item: double>)
table = pa.Table.from_arrays(arrays=[nested_array], names=['my_arrays'])
# convert it to a pandas DataFrame in a loop to monitor memory consumption
num_iterations = 10000
# pyarrow v0.14.1: Memory allocation does not grow during loop execution
# pyarrow v0.15.0: ~550 Mb is added to RAM, never garbage collected
for i in range(num_iterations):
df = pa.Table.to_pandas(table)
# When the table column is not nested, no memory leak is observed
array = pa.array(np.random.rand(500 * 1000))
table = pa.Table.from_arrays(arrays=[array], names=['numbers'])
# no memory leak:
for i in range(num_iterations):
df = pa.Table.to_pandas(table){code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)