[ https://issues.apache.org/jira/browse/SPARK-14652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242604#comment-15242604 ]
Liwei Lin commented on SPARK-14652: ----------------------------------- hi [~weideng], this problem is probably caused by the absence of {{DataFrame.unpersist()}} -- please call {{readingsDataFrame.unpersist()}} at the end of {{foreachRDD}}. If you do need {{cache()}}, I think a better way to do this is call {{cache()}} on {{DStream}} rather than on {{DataFrame}}, i.e. {{readings.cache()}} then {{readings.foreachRDD}}. The reason is that Spark Streaming will call {{unpersist()}} automatically for you after each batch if you use {{DStream.cache()}}, but won't do the same if you use {{DataFrame.cache()}}. > pyspark streaming driver unable to cleanup metadata for cached RDDs leading > to driver OOM > ----------------------------------------------------------------------------------------- > > Key: SPARK-14652 > URL: https://issues.apache.org/jira/browse/SPARK-14652 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming > Affects Versions: 1.6.1 > Environment: pyspark 1.6.1 > python 2.7.6 > Ubuntu 14.04.2 LTS > Oracle JDK 1.8.0_77 > Reporter: Wei Deng > > ContextCleaner was introduced in SPARK-1103 and according to its PR > [here|https://github.com/apache/spark/pull/126]: > {quote} > RDD cleanup: > {{ContextCleaner}} calls {{RDD.unpersist()}} is used to cleanup persisted > RDDs. Regarding metadata, the DAGScheduler automatically cleans up all > metadata related to a RDD after all jobs have completed. Only the > {{SparkContext.persistentRDDs}} keeps strong references to persisted RDDs. > The {{TimeStampedHashMap}} used for that has been replaced by > {{TimeStampedWeakValueHashMap}} that keeps only weak references to the RDDs, > allowing them to be garbage collected. > {quote} > However, we have observed that for a cached RDD in pyspark streaming code > this is not the case with the current latest Spark 1.6.1 version. This is > reflected in the forever growing number of RDDs in the {{Storage}} tab of the > Spark Streaming application's UI once a pyspark streaming code starts to run. > We used the > [writemetrics.py|https://github.com/weideng1/energyiot/blob/f74d3a8b5b01639e6ff53ac461b87bb8a7b1976f/analytics/writemetrics.py] > code to reproduce the problem, and every time after running for 20+ hours, > the driver's JVM will start to show signs of OOM, with old gen being filled > up and JVM stuck in full GC cycles without any old gen JVM space being freed > up, and eventually the driver will crash with OOM. > We have collected heap dump right before the OOM happened, and can make it > available for analysis if it's considered as useful. However, it might be > easier to just monitor the growth of the number of RDDs in the {{Storage}} > tab from the Spark application's UI to confirm this is happening. To > illustrate the problem, we also tried to set {{--conf > spark.cleaner.periodicGC.interval=10s}} in the spark-submit command line of > pyspark code and enabled DEBUG level logging of the driver's logback.xml and > confirmed that even if the cleaner gets triggered as quickly as every 10 > seconds, none of the cached RDDs will be unpersisted automatically by > ContextCleaner. > Currently we have resorted to manually calling unpersist() to work around the > problem. However, this goes against the spirit of SPARK-1103, i.e. automated > garbage collection in the SparkContext. > We also conducted a simple test with Scala code and with setting {{--conf > spark.cleaner.periodicGC.interval=10s}}, and found the Scala code was able to > clean up the RDDs every 10 seconds as expected, so this appears to be a > pyspark specific issue. We suspect it has something to do with python not > being able to pass those out of scope RDDs as weak references to the Context > Cleaner. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org