[ https://issues.apache.org/jira/browse/ARROW-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge updated ARROW-5302: ------------------------- Description: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json() print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) was: The following piece of code (running on a Linux, Python 3.6 from anaconda) demonstrates a memory leak when reading data from disk. {code:java} import resource import pandas as pd import pyarrow as pa import pyarrow.parquet as pq # some random data, some of them as array columns path = 'data.parquet' batches = 5000 df = pd.DataFrame({ 'a': ['AA%d' % i for i in range(batches)], 't': [list(range(0, 180 * 60, 5))] * batches, 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // 5))), 'u': ['t'] * batches, }) pq.write_table(pa.Table.from_pandas(df), path) # read the data above and convert it to json (e.g. the backend of a restful API) for i in range(100): # comment any of the 2 lines for the leak to vanish. df = pq.read_table(path).to_pandas() df.to_json(orient='records') print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) {code} Result : {code:java} 785560 1065460 1383532 1607676 1924820 ...{code} Relevant pip freeze: pyarrow (0.13.0) pandas (0.24.2) > Memory leak when read_table().to_pandas().to_json() > --------------------------------------------------- > > Key: ARROW-5302 > URL: https://issues.apache.org/jira/browse/ARROW-5302 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 0.13.0 > Environment: Linux, Python 3.6.4 :: Anaconda, Inc. > Reporter: Jorge > Priority: Major > Labels: memory-leak > > The following piece of code (running on a Linux, Python 3.6 from anaconda) > demonstrates a memory leak when reading data from disk. > {code:java} > import resource > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > # some random data, some of them as array columns > path = 'data.parquet' > batches = 5000 > df = pd.DataFrame({ > 'a': ['AA%d' % i for i in range(batches)], > 't': [list(range(0, 180 * 60, 5))] * batches, > 'v': list(pd.np.random.normal(10, 0.1, size=(batches, 180 * 60 // > 5))), > 'u': ['t'] * batches, > }) > pq.write_table(pa.Table.from_pandas(df), path) > # read the data above and convert it to json (e.g. the backend of a restful > API) > for i in range(100): > # comment any of the 2 lines for the leak to vanish. > df = pq.read_table(path).to_pandas() > df.to_json() > print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) > {code} > Result : > {code:java} > 785560 > 1065460 > 1383532 > 1607676 > 1924820 > ...{code} > Relevant pip freeze: > pyarrow (0.13.0) > pandas (0.24.2) > -- This message was sent by Atlassian JIRA (v7.6.3#76005)