Variables outside of mapPartitions scope
I am working on some code which uses mapPartitions. Its working great, except when I attempt to use a variable within the function passed to mapPartitions which references something outside of the scope (for example, a variable declared immediately before the mapPartitions call). When this happens, I get a task not serializable error. I wanted to reference a variable which had been broadcasted, and ready to use within that closure. Seeing that, I attempted another solution, to store the broadcasted variable within an object (singleton class, thing). It serialized fine, but when I ran it on a cluster, any reference to it got a null pointer exception, my presumption is that the workers were not getting their objects updated for some reason, despite setting it as a broadcasted variable. My guess is that the workers get the serialized function, but spark doesn't know to serialize the object, including the things it reference. Thus the copied reference becomes invalid. What would be a good way to solve my problem? Is there a way to reference a broadcast variable by name rather through a variable? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Variables outside of mapPartitions scope
Scala's for-loop is not just looping; it's not native looping in bytecode level. It will create a couple of objects at runtime and performs a truckload of method calls on them. As a result, if you are referring the variables outside the for-loop, the whole for-loop object and any variable inside the loop have to be serializable. Since the for-loop is serializable in scala, I guess you have something non-serializable inside the for-loop. The while-loop in scala is native, so you won't have this issue if you use while-loop. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, May 9, 2014 at 1:13 PM, pedro wrote: > Right now I am not using any class variables (references to this). All my > variables are created within the scope of the method I am running. > > I did more debugging and found this strange behavior. > variables here > for loop > mapPartitions call > use variables here > end mapPartitions > endfor > > This will result in a serializable bug, but this won't > > variables here > for loop > create new references to variables here > mapPartitions call > use new reference variables here > end mapPartitions > endfor > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >
Re: Variables outside of mapPartitions scope
In general, you can find out exactly what's not serializable by adding -Dsun.io.serialization.extendedDebugInfo=true to SPARK_JAVA_OPTS. Since a this reference to the enclosing class is often what's causing the problem, a general workaround is to move the mapPartitions call to a static method where there is no this reference. This transforms this: class A { def f() = rdd.mapPartitions(iter => ...)} into this: class A { def f() = A.helper(rdd)}object A { def helper(rdd: RDD[...]) = rdd.mapPartitions(iter => ...)} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5527.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Variables outside of mapPartitions scope
Right now I am not using any class variables (references to this). All my variables are created within the scope of the method I am running. I did more debugging and found this strange behavior. variables here for loop mapPartitions call use variables here end mapPartitions endfor This will result in a serializable bug, but this won't variables here for loop create new references to variables here mapPartitions call use new reference variables here end mapPartitions endfor -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Variables-outside-of-mapPartitions-scope-tp5517p5528.html Sent from the Apache Spark User List mailing list archive at Nabble.com.