Uh, for some reason I don't seem to automatically reply to the list any
more.
Here is again my message to Tom.

---------- Forwarded message ----------

Tom,

On Wed, Aug 13, 2014 at 5:35 AM, Tom Vacek <minnesota...@gmail.com> wrote:

> This is a back-to-basics question.  How do we know when Spark will clone
> an object and distribute it with task closures versus synchronize access to
> it.
>
> For example, the old rookie mistake of random number generation:
>
> import scala.util.Random
> val randRDD = sc.parallelize(0 until 1000).map(ii => Random.nextGaussian)
>
> One can check to see that each partition contains a different set of
> random numbers, so the RNG obviously was not cloned, but access was
> synchronized.
>

In this case, Random is a singleton object; Random.nextGaussian is like a
static method of a Java class. The access is not synchronized (unless I
misunderstand "synchronized"), but each Spark worker will use a JVM-local
instance of the Random object. You don't actually close over the Random
object in this case. In fact, this is one way to have node-local state
(e.g., for DB connection pooling).


> However:
>
> val myMap = collection.mutable.Map.empty[Int,Int]
> sc.parallelize(0 until 100).mapPartitions(it => {it.foreach(ii => myMap(ii) = 
> ii); Array(myMap).iterator}).collect
>
>
> This shows that each partition got a copy of the empty map and filled it
> in with its portion of the rdd.
>

In this case, myMap is an instance of the Map class, so it will be
serialized and shipped around. In fact, if you did `val Random = new
scala.util.Random()` in your code above, then this object would also be
serialized and treated just as myMap. (NB. No, it is not. Spark hangs for
me when I do this and doesn't return anything...)

Tobias

Reply via email to