[ https://issues.apache.org/jira/browse/SPARK-6194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-6194: ----------------------------------- Assignee: Apache Spark (was: Davies Liu) > collect() in PySpark will cause memory leak in JVM > -------------------------------------------------- > > Key: SPARK-6194 > URL: https://issues.apache.org/jira/browse/SPARK-6194 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 > Reporter: Davies Liu > Assignee: Apache Spark > Priority: Critical > Fix For: 1.2.2, 1.3.1, 1.4.0 > > > It could be reproduced by: > {code} > for i in range(40): > sc.parallelize(range(5000), 10).flatMap(lambda i: range(10000)).collect() > {code} > It will fail after 2 or 3 jobs, and run totally successfully if I add > `gc.collect()` after each job. > We could call _detach() for the JavaList returned by collect > in Java, will send out a PR for this. > Reported by Michael and commented by Josh: > On Thu, Mar 5, 2015 at 2:39 PM, Josh Rosen <joshro...@databricks.com> wrote: > > Based on Py4J's Memory Model page > > (http://py4j.sourceforge.net/advanced_topics.html#py4j-memory-model): > > > >> Because Java objects on the Python side are involved in a circular > >> reference (JavaObject and JavaMember reference each other), these objects > >> are not immediately garbage collected once the last reference to the object > >> is removed (but they are guaranteed to be eventually collected if the > >> Python > >> garbage collector runs before the Python program exits). > > > > > >> > >> In doubt, users can always call the detach function on the Python gateway > >> to explicitly delete a reference on the Java side. A call to gc.collect() > >> also usually works. > > > > > > Maybe we should be manually calling detach() when the Python-side has > > finished consuming temporary objects from the JVM. Do you have a small > > workload / configuration that reproduces the OOM which we can use to test a > > fix? I don't think that I've seen this issue in the past, but this might be > > because we mistook Java OOMs as being caused by collecting too much data > > rather than due to memory leaks. > > > > On Thu, Mar 5, 2015 at 10:41 AM, Michael Nazario <mnaza...@palantir.com> > > wrote: > >> > >> Hi Josh, > >> > >> I have a question about how PySpark does memory management in the Py4J > >> bridge between the Java driver and the Python driver. I was wondering if > >> there have been any memory problems in this system because the Python > >> garbage collector does not collect circular references immediately and Py4J > >> has circular references in each object it receives from Java. > >> > >> When I dug through the PySpark code, I seemed to find that most RDD > >> actions return by calling collect. In collect, you end up calling the Java > >> RDD collect and getting an iterator from that. Would this be a possible > >> cause for a Java driver OutOfMemoryException because there are resources in > >> Java which do not get freed up immediately? > >> > >> I have also seen that trying to take a lot of values from a dataset twice > >> in a row can cause the Java driver to OOM (while just once works). Are > >> there > >> some other memory considerations that are relevant in the driver? > >> > >> Thanks, > >> Michael -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org