[ https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357319#comment-16357319 ]
ASF GitHub Bot commented on ARROW-1973: --------------------------------------- cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes. URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997 ########## File path: cpp/src/arrow/python/arrow_to_pandas.cc ########## @@ -1484,8 +1480,53 @@ class ArrowDeserializer { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); } else { - // PyArray_SetBaseObject steals our reference to base - Py_INCREF(base); + /* + * See ARROW-1973 for the original memory leak report. + * + * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr + * + * 1. py_ref_ is nullptr (it **was not** passed in to ArrowDeserializer's + * constructor) + * + * In this case, the stolen reference must not be incremented since nothing + * outside of the PyArrayObject* (the arr_ member) is holding a reference to + * it. If we increment this, then we have a memory leak. + * + * + * Here's an example of how memory can be leaked when converting an arrow Array + * of List<Float64>.to a numpy array + * + * 1. Create a 1D numpy that is the flattened arrow array. + * + * There's nothing outside of the serializer that owns this new numpy array. + * + * 2. Make a capsule for the base array. + * + * The reference count of base is 1. + * + * 3. Call PyArray_SetBaseObject(arr_, base) + * + * The reference count is still 1, because the reference is stolen. + * + * 4. Increment the reference count of base (unconditionally) + * + * The reference count is now 2. This is okay if there's an object holding + * another reference. The PyArrayObject that stole the reference will + * eventually decrement the reference count, which will leaves us with a + * refcount of 1, with nothing owning that 1 reference. Memory leakage + * ensues. + * + * 2. py_ref_ is not nullptr (it **was** passed in to ArrowDeserializer's + * constructor) + * + * This case is simpler. We assume that the reference accounting is correct + * coming in. We need to preserve that accounting knowing that the + * PyArrayObject that stole the reference will eventually decref it, thus we + * increment the reference count. + */ + if (base == py_ref_) { + Py_INCREF(base); Review comment: True, I can change. Most of the work here was understanding the accounting flow here :) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Memory leak when converting Arrow tables with array columns to > Pandas dataframes. > ------------------------------------------------------------------------------------------ > > Key: ARROW-1973 > URL: https://issues.apache.org/jira/browse/ARROW-1973 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.8.0 > Environment: Linux Mint 18.2 > Anaconda Python distribution + pyarrow installed from the conda-forge channel > Reporter: Alexey Strokach > Assignee: Phillip Cloud > Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > There appears to be a memory leak when using PyArrow to convert tables > containing array columns to Pandas DataFrames. > See the `test_memory_leak.py` example here: > https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors -- This message was sent by Atlassian JIRA (v7.6.3#76005)