[ https://issues.apache.org/jira/browse/SPARK-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259807#comment-14259807 ]
Fi edited comment on SPARK-4882 at 12/29/14 3:09 AM: ----------------------------------------------------- No problem about updating the description, it's more concise and addresses the core problem. Thanks for the KryoSerializer and PySpark clarification, the docs didn't make it obvious if it was beneficial or a NO-OP. I figured it may help and it didn't seem to break anything, so I went ahead left set that in the spark defaults, and thus ran into this problem. At least the workaround works and lets me leverage the broadcast feature, and in my use cases, I saw no apparent degradation in performance. was (Author: coderfi): No problem about updating the description, it's more concise and addresses the core problem. Thanks for the KryoSerializer and PySpark clarification, the docs didn't make it obvious if it was beneficial or a NO-OP. I figured it may help and it didn't seem to break anything, so I went ahead left set that in the spark defaults, and thus ran into this problem. At least the workaround works lets me leverage the broadcast feature, and in my use cases, no apparent degradation in performance. > PySpark broadcast breaks when using KryoSerializer > -------------------------------------------------- > > Key: SPARK-4882 > URL: https://issues.apache.org/jira/browse/SPARK-4882 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.1.1, 1.2.0, 1.3.0 > Reporter: Fi > Assignee: Josh Rosen > > When KryoSerializer is used, PySpark will throw NullPointerException when > trying to send broadcast variables to workers. This issue does not occur > when the master is {{local}}, or when using the default JavaSerializer. > *Reproduction*: > Run > {code} > SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark --master local-cluster[2,2,512] --conf > spark.serializer=org.apache.spark.serializer.KryoSerializer > {code} > then run > {code} > b = sc.broadcast("hello") > sc.parallelize([0]).flatMap(lambda x: b.value).collect() > {code} > This job fails because all tasks throw the following exception: > {code} > 14/12/28 14:26:08 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 8, > localhost): java.lang.NullPointerException > at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:589) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:232) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:228) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:228) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203) > at > org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1515) > at > org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:202) > {code} > KryoSerializer may be enabled in the {{spark-defaults.conf}} file, so users > may hit this error and be confused. > *Workaround*: > Override the {{spark.serializer}} setting to use the default Java serializer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org