[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

Apache Spark (JIRA) Thu, 02 Apr 2015 10:46:31 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Apache Spark reassigned SPARK-6194:
-----------------------------------

    Assignee: Apache Spark  (was: Davies Liu)

> collect() in PySpark will cause memory leak in JVM
> --------------------------------------------------
>
>                 Key: SPARK-6194
>                 URL: https://issues.apache.org/jira/browse/SPARK-6194
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>            Reporter: Davies Liu
>            Assignee: Apache Spark
>            Priority: Critical
>             Fix For: 1.2.2, 1.3.1, 1.4.0
>
>
> It could be reproduced  by:
> {code}
> for i in range(40):
>     sc.parallelize(range(5000), 10).flatMap(lambda i: range(10000)).collect()
> {code}
> It will fail after 2 or 3 jobs, and run totally successfully if I add
> `gc.collect()` after each job.
> We could call _detach() for the JavaList returned by collect
> in Java, will send out a PR for this.
> Reported by Michael and commented by Josh：
> On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen <joshro...@databricks.com> wrote:
> > Based on Py4J's Memory Model page
> > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model):
> >
> >> Because Java objects on the Python side are involved in a circular
> >> reference (JavaObject and JavaMember reference each other), these objects
> >> are not immediately garbage collected once the last reference to the object
> >> is removed (but they are guaranteed to be eventually collected if the 
> >> Python
> >> garbage collector runs before the Python program exits).
> >
> >
> >>
> >> In doubt, users can always call the detach function on the Python gateway
> >> to explicitly delete a reference on the Java side. A call to gc.collect()
> >> also usually works.
> >
> >
> > Maybe we should be manually calling detach() when the Python-side has
> > finished consuming temporary objects from the JVM.  Do you have a small
> > workload / configuration that reproduces the OOM which we can use to test a
> > fix?  I don't think that I've seen this issue in the past, but this might be
> > because we mistook Java OOMs as being caused by collecting too much data
> > rather than due to memory leaks.
> >
> > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario <mnaza...@palantir.com>
> > wrote:
> >>
> >> Hi Josh,
> >>
> >> I have a question about how PySpark does memory management in the Py4J
> >> bridge between the Java driver and the Python driver. I was wondering if
> >> there have been any memory problems in this system because the Python
> >> garbage collector does not collect circular references immediately and Py4J
> >> has circular references in each object it receives from Java.
> >>
> >> When I dug through the PySpark code, I seemed to find that most RDD
> >> actions return by calling collect. In collect, you end up calling the Java
> >> RDD collect and getting an iterator from that. Would this be a possible
> >> cause for a Java driver OutOfMemoryException because there are resources in
> >> Java which do not get freed up immediately?
> >>
> >> I have also seen that trying to take a lot of values from a dataset twice
> >> in a row can cause the Java driver to OOM (while just once works). Are 
> >> there
> >> some other memory considerations that are relevant in the driver?
> >>
> >> Thanks,
> >> Michael



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6194) collect() in PySpark will cause memory leak in JVM

Reply via email to