an example for Sebastian's response is:

rdd.mapPartitions { partitionIter =>
  val a = new Array[Int](100)
  partitionIter.map { e =>
    .. Some calculation that reuses `a` ...
  }
}

If the array were outside the map(), then that closure, the array and the
outer block (corresponding to the array reference variable's scope) will be
serialized and shipped once for each RDD task that gets executed. Since
there an RDD task is created for materializing each RDD partition, there
would still be an array used for each partition. Also, the serialized outer
block could be an entire class, which increases the size of each task,
which increases the scheduling latency...


On Tue, Nov 19, 2013 at 1:01 AM, Wenlei Xie <wenlei....@gmail.com> wrote:

> Hi,
>
> I am trying to do some tasks with the following style map function:
>
> rdd.map { e =>
>     val a = new Array[Int](100)
>     ...Some calculation...
> }
>
> But here the array a is really just used as a temporary buffer and can be
> reused. I am wondering if I can avoid constructing it everytime? (As it
> might incur some overhead for JVM?) Would use an array outside the closure
> work?
>
> Thank you!
>
> Best,
> Wenlei
>
>

Reply via email to