[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Dmitry Kashtanov (Jira) Fri, 05 Feb 2021 06:54:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279733#comment-17279733
 ]


Dmitry Kashtanov commented on ARROW-11007:
------------------------------------------

 

> That doesn't really answer the question: what does it measure? RSS? Virtual 
>memory size?

It looks like `memory_profiler` uses the first item from the tuple returned by 
`psutil.Process().memory_info()` which is `rss`.

 

> Can you run "bqs_stream_to_pandas" in a loop and see whether memory usage 
>increases? Or does it stay stable as its initial peak value?

PSB. It doesn't increase (almost).
{code:java}
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
...
   117   2866.0 MiB   2713.1 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   118   2865.6 MiB     -0.4 MiB           1       del dataset
   119   2874.6 MiB      9.0 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   120   2874.6 MiB      0.0 MiB           1       del dataset
   121   2887.0 MiB     12.4 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   122   2878.2 MiB     -8.8 MiB           1       del dataset
   123   2903.2 MiB     25.1 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   124   2903.2 MiB      0.0 MiB           1       del dataset
   125   2899.2 MiB     -4.1 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   126   2899.2 MiB      0.0 MiB           1       del dataset
   127   2887.9 MiB    -11.3 MiB           1       dataset = 
bqs_stream_to_pandas(session, stream_name)
   128   2887.9 MiB      0.0 MiB           1       del dataset
{code}
 

Interestingly, the first chunk of memory is freed when gRPC connection/session 
(may call it incorrecty) is reset: 
{code:java}
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   471   2898.9 MiB   2898.9 MiB           1   @profile
   472                                         def 
bqs_stream_to_pandas(session, stream_name, row_limit=3660000):
   474   2898.9 MiB      0.0 MiB           1       client = 
bqs.BigQueryReadClient()
   475   1628.4 MiB  -1270.5 MiB           1       reader = 
client.read_rows(name=stream_name, offset=0)
   476   1628.4 MiB      0.0 MiB           1       rows = reader.rows(session)
...
{code}
If a `message` is `google.protobuf` message and a batch is created like below, 
will it be a zero-copy operation?
{code:java}
pyarrow.ipc.read_record_batch(
    pyarrow.py_buffer(message.arrow_record_batch.serialized_record_batch),
    self._schema,
)
{code}
 

 

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we 
> observed a memory leak in the read_table and to_pandas methods. See below for 
> sample code to reproduce it. Memory does not seem to be returned after 
> deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = 
> table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Reply via email to