[ 
https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095063#comment-14095063
 ] 

Vlad Frolov commented on SPARK-1065:
------------------------------------

Heavy tasks completed in 18 minutes each instead of 22 minutes, which is 20% 
speed up. That is nice!
I don't see any problems on my YARN cluster. Java nodes eat up to 1.5GB RAM 
(which is my JVM limit) each and Python daemons eat around 650MB each. Though 
those numbers are still a bit weird, it is obvious to me that workers don't 
leak now.

Thank you a lot!

> PySpark runs out of memory with large broadcast variables
> ---------------------------------------------------------
>
>                 Key: SPARK-1065
>                 URL: https://issues.apache.org/jira/browse/SPARK-1065
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.7.3, 0.8.1, 0.9.0
>            Reporter: Josh Rosen
>            Assignee: Davies Liu
>
> PySpark's driver components may run out of memory when broadcasting large 
> variables (say 1 gigabyte).
> Because PySpark's broadcast is implemented on top of Java Spark's broadcast 
> by broadcasting a pickled Python as a byte array, we may be retaining 
> multiple copies of the large object: a pickled copy in the JVM and a 
> deserialized copy in the Python driver.
> The problem could also be due to memory requirements during pickling.
> PySpark is also affected by broadcast variables not being garbage collected.  
> Adding an unpersist() method to broadcast variables may fix this: 
> https://github.com/apache/incubator-spark/pull/543.
> As a first step to fixing this, we should write a failing test to reproduce 
> the error.
> This was discovered by [~sandy]: ["trouble with broadcast variables on 
> pyspark"|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to