I'm experiencing some strange behavior with closure serialization that is totally mind-boggling to me. It appears that two arrays of equal size take up vastly different amount of space inside closures if they're generated in different ways.
The basic flow of my app is to run a bunch of tiny regressions using Commons Math's OLSMultipleLinearRegression and then reference a 2D array of the results from a transformation. I was running into OOME's and NotSerializableExceptions and tried to get closer to the root issue by calling the closure serializer directly. scala> val arr = models.map(_.estimateRegressionParameters()).toArray The result array is 1867 x 5. It serialized is 80k bytes, which seems about right: scala> SparkEnv.get.closureSerializer.newInstance().serialize(arr) res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 cap=80027] If I reference it from a simple function: scala> def func(x: Long) => arr.length scala> SparkEnv.get.closureSerializer.newInstance().serialize(func) I get a NotSerializableException. If I take pains to create the array using a loop: scala> val arr = Array.ofDim[Double](1867, 5) scala> for (s <- 0 until models.length) { | factorWeights(s) = models(s).estimateRegressionParameters() | } Serialization works, but the serialized closure for the function is a whopping 400MB. If I pass in an array of the same length that was created in a different way, the size of the serialized closure is only about 90K, which seems about right. Naively, it seems like somehow the history of how the array was created is having an effect on what happens to it inside a closure. Is this expected behavior? Can anybody explain what's going on? any insight very appreciated, Sandy