I'm experiencing some strange behavior with closure serialization that is
totally mind-boggling to me.  It appears that two arrays of equal size take
up vastly different amount of space inside closures if they're generated in
different ways.

The basic flow of my app is to run a bunch of tiny regressions using
Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
the results from a transformation.  I was running into OOME's and
NotSerializableExceptions and tried to get closer to the root issue by
calling the closure serializer directly.
  scala> val arr = models.map(_.estimateRegressionParameters()).toArray

The result array is 1867 x 5. It serialized is 80k bytes, which seems about
right:
  scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr)
  res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
cap=80027]

If I reference it from a simple function:
  scala> def func(x: Long) => arr.length
  scala> SparkEnv.get.closureSerializer.newInstance().serialize(func)
I get a NotSerializableException.

If I take pains to create the array using a loop:
  scala> val arr = Array.ofDim[Double](1867, 5)
  scala> for (s <- 0 until models.length) {
  | factorWeights(s) = models(s).estimateRegressionParameters()
  | }
Serialization works, but the serialized closure for the function is a
whopping 400MB.

If I pass in an array of the same length that was created in a different
way, the size of the serialized closure is only about 90K, which seems
about right.

Naively, it seems like somehow the history of how the array was created is
having an effect on what happens to it inside a closure.

Is this expected behavior?  Can anybody explain what's going on?

any insight very appreciated,
Sandy

Reply via email to