[ 
https://issues.apache.org/jira/browse/SPARK-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259807#comment-14259807
 ] 

Fi edited comment on SPARK-4882 at 12/29/14 3:09 AM:
-----------------------------------------------------

No problem about updating the description, it's more concise and addresses the 
core problem.

Thanks for the KryoSerializer and PySpark clarification, the docs didn't make 
it obvious if it was beneficial or a NO-OP.
I figured it may help and it didn't seem to break anything, so I went ahead 
left set that in the spark defaults, and thus ran into this problem.

At least the workaround works and lets me leverage the broadcast feature, and 
in my use cases, I saw no apparent degradation in performance.




was (Author: coderfi):
No problem about updating the description, it's more concise and addresses the 
core problem.

Thanks for the KryoSerializer and PySpark clarification, the docs didn't make 
it obvious if it was beneficial or a NO-OP.
I figured it may help and it didn't seem to break anything, so I went ahead 
left set that in the spark defaults, and thus ran into this problem.

At least the workaround works lets me leverage the broadcast feature, and in my 
use cases, no apparent degradation in performance.



> PySpark broadcast breaks when using KryoSerializer
> --------------------------------------------------
>
>                 Key: SPARK-4882
>                 URL: https://issues.apache.org/jira/browse/SPARK-4882
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.1.1, 1.2.0, 1.3.0
>            Reporter: Fi
>            Assignee: Josh Rosen
>
> When KryoSerializer is used, PySpark will throw NullPointerException when 
> trying to send broadcast variables to workers.  This issue does not occur 
> when the master is {{local}}, or when using the default JavaSerializer.
> *Reproduction*:
> Run
> {code}
> SPARK_LOCAL_IP=127.0.0.1 ./bin/pyspark --master local-cluster[2,2,512] --conf 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> {code}
> then run
> {code}
> b = sc.broadcast("hello")
> sc.parallelize([0]).flatMap(lambda x: b.value).collect()
> {code}
> This job fails because all tasks throw the following exception:
> {code}
> 14/12/28 14:26:08 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 8, 
> localhost): java.lang.NullPointerException
>       at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:589)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:232)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:228)
>       at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>       at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>       at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>       at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:228)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:203)
>       at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1515)
>       at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:202)
> {code}
> KryoSerializer may be enabled in the {{spark-defaults.conf}} file, so users 
> may hit this error and be confused.
> *Workaround*:
> Override the {{spark.serializer}} setting to use the default Java serializer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to