I am working on some code which uses mapPartitions. Its working great, except when I attempt to use a variable within the function passed to mapPartitions which references something outside of the scope (for example, a variable declared immediately before the mapPartitions call). When this happens, I get a task not serializable error. I wanted to reference a variable which had been broadcasted, and ready to use within that closure.
Seeing that, I attempted another solution, to store the broadcasted variable within an object (singleton class, thing). It serialized fine, but when I ran it on a cluster, any reference to it got a null pointer exception, my presumption is that the workers were not getting their objects updated for some reason, despite setting it as a broadcasted variable. My guess is that the workers get the serialized function, but spark doesn't know to serialize the object, including the things it reference. Thus the copied reference becomes invalid. What would be a good way to solve my problem? Is there a way to reference a broadcast variable by name rather through a variable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517.html Sent from the Apache Spark User List mailing list archive at Nabble.com.