James Porritt created ARROW-1017:
------------------------------------
Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks
memory
Key: ARROW-1017
URL: https://issues.apache.org/jira/browse/ARROW-1017
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.3.0
Reporter: James Porritt
Running the following code results in ever increasing memory usage, even though
I would expect the dataframe to be garbage collected when it goes out of scope.
For the size of my parquet file, I see the usage increasing about 3GB per loop:
{code}
from pyarrow import HdfsClient
def read_parquet_file(client, parquet_file):
parquet = client.read_parquet(parquet_file)
df = parquet.to_pandas()
client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
read_parquet_file(client, parquet_file)
{code}
Is there a reference count issue similar to ARROW-362?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)