[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

ASF GitHub Bot (JIRA) Thu, 08 Feb 2018 10:01:01 -0800

    [ 
https://issues.apache.org/jira/browse/ARROW-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357319#comment-16357319
 ]


ASF GitHub Bot commented on ARROW-1973:
---------------------------------------

cpcloud commented on a change in pull request #1578: ARROW-1973: [Python] 
Memory leak when converting Arrow tables with array columns to Pandas 
dataframes.
URL: https://github.com/apache/arrow/pull/1578#discussion_r167017997
 
 

 ##########
 File path: cpp/src/arrow/python/arrow_to_pandas.cc
 ##########
 @@ -1484,8 +1480,53 @@ class ArrowDeserializer {
       // Error occurred, trust that SetBaseObject set the error state
       return Status::OK();
     } else {
-      // PyArray_SetBaseObject steals our reference to base
-      Py_INCREF(base);
+      /*
+       * See ARROW-1973 for the original memory leak report.
+       *
+       * There are two scenarios: py_ref_ is nullptr or py_ref_ is not nullptr
+       *
+       *   1. py_ref_ is nullptr (it **was not** passed in to 
ArrowDeserializer's
+       *      constructor)
+       *
+       *      In this case, the stolen reference must not be incremented since 
nothing
+       *      outside of the PyArrayObject* (the arr_ member) is holding a 
reference to
+       *      it. If we increment this, then we have a memory leak.
+       *
+       *
+       *      Here's an example of how memory can be leaked when converting an 
arrow Array
+       *      of List<Float64>.to a numpy array
+       *
+       *      1. Create a 1D numpy that is the flattened arrow array.
+       *
+       *         There's nothing outside of the serializer that owns this new 
numpy array.
+       *
+       *      2. Make a capsule for the base array.
+       *
+       *         The reference count of base is 1.
+       *
+       *      3. Call PyArray_SetBaseObject(arr_, base)
+       *
+       *         The reference count is still 1, because the reference is 
stolen.
+       *
+       *      4. Increment the reference count of base (unconditionally)
+       *
+       *         The reference count is now 2. This is okay if there's an 
object holding
+       *         another reference. The PyArrayObject that stole the reference 
will
+       *         eventually decrement the reference count, which will leaves 
us with a
+       *         refcount of 1, with nothing owning that 1 reference. Memory 
leakage
+       *         ensues.
+       *
+       *   2. py_ref_ is not nullptr (it **was** passed in to 
ArrowDeserializer's
+       *      constructor)
+       *
+       *      This case is simpler. We assume that the reference accounting is 
correct
+       *      coming in. We need to preserve that accounting knowing that the
+       *      PyArrayObject that stole the reference will eventually decref 
it, thus we
+       *      increment the reference count.
+       */
+      if (base == py_ref_) {
+        Py_INCREF(base);
 
 Review comment:
   True, I can change. Most of the work here was understanding the accounting 
flow here :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Memory leak when converting Arrow tables with array columns to 
> Pandas dataframes.
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-1973
>                 URL: https://issues.apache.org/jira/browse/ARROW-1973
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.8.0
>         Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>            Reporter: Alexey Strokach
>            Assignee: Phillip Cloud
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> There appears to be a memory leak when using PyArrow to convert tables 
> containing array columns to Pandas DataFrames.
>  See the `test_memory_leak.py` example here: 
> https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1973) [Python] Memory leak when converting Arrow tables with array columns to Pandas dataframes.

Reply via email to