[ 
https://issues.apache.org/jira/browse/ARROW-13187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369702#comment-17369702
 ] 

Weston Pace commented on ARROW-13187:
-------------------------------------

Great reproduction, thank you.  I can reproduce this on 4.0.0 but not on 3.0.0. 
 A few observations so far:

pa.total_allocated_bytes is increasing so it is not a dynamic allocator blowup 
issue.

"del table" prevents the out-of-ram (same as the table.slice above).

"gc.collect" prevents the out-of-ram

 

Those workarounds shouldn't be necessary however.  When read_in_the_csv exits 
the table is no longer needed, it's refcount should decrease by 1, and it 
should be eligible for garbage collection.  Combined with the fact that this 
doesn't occur on 3.0.0 (both environments are using python 3.8 although 3.8.6 
vs 3.8.8 but I doubt it's a python change) I think this means that a circular 
reference was introduced in the Arrow->Python code between 3.0.0 and 4.0.0.

> Possibly memory not deallocated when reading in CSV
> ---------------------------------------------------
>
>                 Key: ARROW-13187
>                 URL: https://issues.apache.org/jira/browse/ARROW-13187
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 4.0.1
>            Reporter: Simon
>            Priority: Minor
>
> When one reads in a table from CSV in pyarrow version 4.0.1, it appears that 
> the read-in table variable is not freed (or not fast enough). I'm unsure if 
> this is because of pyarrow or because of the way pyarrow memory allocation 
> interacts with Python memory allocation. I encountered it when processing 
> many large CSVs sequentially.
> When I run the following piece of code, the RAM memory usage increases quite 
> rapidly until it runs out of memory.
> {code:python}
> import pyarrow as pa
> import pyarrow.csv
> # Generate some CSV file to read in
> print("Generating CSV")
> with open("example.csv", "w+") as f_out:
>     for i in range(0, 10000000):
>         f_out.write("123456789,abc def ghi jkl\n")
> def read_in_the_csv():
>     table = pa.csv.read_csv("example.csv")
>     print(table)  # Not strictly necessary to replicate bug, table can also 
> be an unused variable
>     # This will free up the memory, as a workaround:
>     # table = table.slice(0, 0)
> # Read in the CSV many times
> print("Reading in a CSV many times")
> for j in range(100000):
>     read_in_the_csv()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to