I am working on some code which uses mapPartitions. Its working great, except
when I attempt to use a variable within the function passed to mapPartitions
which references something outside of the scope (for example, a variable
declared immediately before the mapPartitions call). When this happens, I
get a task not serializable error. I wanted to reference a variable which
had been broadcasted, and ready to use within that closure.

Seeing that, I attempted another solution, to store the broadcasted variable
within an object (singleton class, thing). It serialized fine, but when I
ran it on a cluster, any reference to it got a null pointer exception, my
presumption is that the workers were not getting their objects updated for
some reason, despite setting it as a broadcasted variable. My guess is that
the workers get the serialized function, but spark doesn't know to serialize
the object, including the things it reference. Thus the copied reference
becomes invalid.

What would be a good way to solve my problem? Is there a way to reference a
broadcast variable by name rather through a variable?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to