[ https://issues.apache.org/jira/browse/SPARK-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093413#comment-14093413 ]
Vlad Frolov commented on SPARK-1065: ------------------------------------ I am facing the same issue in my project, where I use PySpark. As a proof of that the big objects I have could easily fit into nodes' memory, I am going to use dummy solution of saving my big objects into HDFS and load them on Python nodes. Does anybody have an idea how to fix the issue in a better way? I don't have enough either Scala nor Java knowledge to fix this in Spark core. However, I feel like broadcast variables could be reimplemented on Python side though it seems a bit dangerous idea because we don't want to have separate implementations of one thing in both languages. That will also save memory, because while we use broadcasts through Scala we have 1 copy in JVM, 1 pickled copy in Python and 1 constructed object copy in Python. > PySpark runs out of memory with large broadcast variables > --------------------------------------------------------- > > Key: SPARK-1065 > URL: https://issues.apache.org/jira/browse/SPARK-1065 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 0.7.3, 0.8.1, 0.9.0 > Reporter: Josh Rosen > > PySpark's driver components may run out of memory when broadcasting large > variables (say 1 gigabyte). > Because PySpark's broadcast is implemented on top of Java Spark's broadcast > by broadcasting a pickled Python as a byte array, we may be retaining > multiple copies of the large object: a pickled copy in the JVM and a > deserialized copy in the Python driver. > The problem could also be due to memory requirements during pickling. > PySpark is also affected by broadcast variables not being garbage collected. > Adding an unpersist() method to broadcast variables may fix this: > https://github.com/apache/incubator-spark/pull/543. > As a first step to fixing this, we should write a failing test to reproduce > the error. > This was discovered by [~sandy]: ["trouble with broadcast variables on > pyspark"|http://apache-spark-user-list.1001560.n3.nabble.com/trouble-with-broadcast-variables-on-pyspark-tp1301.html]. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org