Re: closure serialization behavior driving me crazy

2014-11-11 Thread Sandy Ryza
I tried turning on the extended debug info.  The Scala output is a little
opaque (lots of - field (class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name:
$iw, type: class $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC), but it seems
like, as expected, somehow the full array of OLSMultipleLinearRegression
objects is getting pulled in.

I'm not sure I understand your comment about Array.ofDim being large.  When
serializing the array alone, it only takes up about 80K, which is close to
1867*5*sizeof(double).  The 400MB comes when referencing the array from a
function, which pulls in all the extra data.

Copying the global variable into a local one seems to work.  Much
appreciated, Matei.


On Mon, Nov 10, 2014 at 9:26 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Sandy,

 Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the
 JVM to print the contents of the objects. In addition, something else that
 helps is to do the following:

 {
   val  _arr = arr
   models.map(... _arr ...)
 }

 Basically, copy the global variable into a local one. Then the field
 access from outside (from the interpreter-generated object that contains
 the line initializing arr) is no longer required, and the closure no longer
 has a reference to that.

 I'm really confused as to why Array.ofDim would be so large by the way,
 but are you sure you haven't flipped around the dimensions (e.g. it should
 be 5 x 1800)? A 5-double array will consume more than 5*8 bytes (probably
 something like 60 at least), and an array of those will still have a
 pointer to each one, so I'd expect that many of them to be more than 80 MB
 (which is very close to 1867*5*8).

 Matei

  On Nov 10, 2014, at 1:01 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
  I'm experiencing some strange behavior with closure serialization that
 is totally mind-boggling to me.  It appears that two arrays of equal size
 take up vastly different amount of space inside closures if they're
 generated in different ways.
 
  The basic flow of my app is to run a bunch of tiny regressions using
 Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
 the results from a transformation.  I was running into OOME's and
 NotSerializableExceptions and tried to get closer to the root issue by
 calling the closure serializer directly.
scala val arr = models.map(_.estimateRegressionParameters()).toArray
 
  The result array is 1867 x 5. It serialized is 80k bytes, which seems
 about right:
scala SparkEnv.get.closureSerializer.newInstance().serialize(arr)
res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
 cap=80027]
 
  If I reference it from a simple function:
scala def func(x: Long) = arr.length
scala SparkEnv.get.closureSerializer.newInstance().serialize(func)
  I get a NotSerializableException.
 
  If I take pains to create the array using a loop:
scala val arr = Array.ofDim[Double](1867, 5)
scala for (s - 0 until models.length) {
| factorWeights(s) = models(s).estimateRegressionParameters()
| }
  Serialization works, but the serialized closure for the function is a
 whopping 400MB.
 
  If I pass in an array of the same length that was created in a different
 way, the size of the serialized closure is only about 90K, which seems
 about right.
 
  Naively, it seems like somehow the history of how the array was created
 is having an effect on what happens to it inside a closure.
 
  Is this expected behavior?  Can anybody explain what's going on?
 
  any insight very appreciated,
  Sandy




closure serialization behavior driving me crazy

2014-11-10 Thread Sandy Ryza
I'm experiencing some strange behavior with closure serialization that is
totally mind-boggling to me.  It appears that two arrays of equal size take
up vastly different amount of space inside closures if they're generated in
different ways.

The basic flow of my app is to run a bunch of tiny regressions using
Commons Math's OLSMultipleLinearRegression and then reference a 2D array of
the results from a transformation.  I was running into OOME's and
NotSerializableExceptions and tried to get closer to the root issue by
calling the closure serializer directly.
  scala val arr = models.map(_.estimateRegressionParameters()).toArray

The result array is 1867 x 5. It serialized is 80k bytes, which seems about
right:
  scala SparkEnv.get.closureSerializer.newInstance().serialize(arr)
  res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027
cap=80027]

If I reference it from a simple function:
  scala def func(x: Long) = arr.length
  scala SparkEnv.get.closureSerializer.newInstance().serialize(func)
I get a NotSerializableException.

If I take pains to create the array using a loop:
  scala val arr = Array.ofDim[Double](1867, 5)
  scala for (s - 0 until models.length) {
  | factorWeights(s) = models(s).estimateRegressionParameters()
  | }
Serialization works, but the serialized closure for the function is a
whopping 400MB.

If I pass in an array of the same length that was created in a different
way, the size of the serialized closure is only about 90K, which seems
about right.

Naively, it seems like somehow the history of how the array was created is
having an effect on what happens to it inside a closure.

Is this expected behavior?  Can anybody explain what's going on?

any insight very appreciated,
Sandy


Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy,

Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to 
print the contents of the objects. In addition, something else that helps is to 
do the following:

{
  val  _arr = arr
  models.map(... _arr ...)
}

Basically, copy the global variable into a local one. Then the field access 
from outside (from the interpreter-generated object that contains the line 
initializing arr) is no longer required, and the closure no longer has a 
reference to that.

I'm really confused as to why Array.ofDim would be so large by the way, but are 
you sure you haven't flipped around the dimensions (e.g. it should be 5 x 
1800)? A 5-double array will consume more than 5*8 bytes (probably something 
like 60 at least), and an array of those will still have a pointer to each one, 
so I'd expect that many of them to be more than 80 MB (which is very close to 
1867*5*8).

Matei

 On Nov 10, 2014, at 1:01 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 I'm experiencing some strange behavior with closure serialization that is 
 totally mind-boggling to me.  It appears that two arrays of equal size take 
 up vastly different amount of space inside closures if they're generated in 
 different ways.
 
 The basic flow of my app is to run a bunch of tiny regressions using Commons 
 Math's OLSMultipleLinearRegression and then reference a 2D array of the 
 results from a transformation.  I was running into OOME's and 
 NotSerializableExceptions and tried to get closer to the root issue by 
 calling the closure serializer directly.
   scala val arr = models.map(_.estimateRegressionParameters()).toArray
 
 The result array is 1867 x 5. It serialized is 80k bytes, which seems about 
 right:
   scala SparkEnv.get.closureSerializer.newInstance().serialize(arr)
   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 
 cap=80027]
 
 If I reference it from a simple function:
   scala def func(x: Long) = arr.length
   scala SparkEnv.get.closureSerializer.newInstance().serialize(func)
 I get a NotSerializableException.
 
 If I take pains to create the array using a loop:
   scala val arr = Array.ofDim[Double](1867, 5)
   scala for (s - 0 until models.length) {
   | factorWeights(s) = models(s).estimateRegressionParameters()
   | }
 Serialization works, but the serialized closure for the function is a 
 whopping 400MB.
 
 If I pass in an array of the same length that was created in a different way, 
 the size of the serialized closure is only about 90K, which seems about right.
 
 Naively, it seems like somehow the history of how the array was created is 
 having an effect on what happens to it inside a closure.
 
 Is this expected behavior?  Can anybody explain what's going on?
 
 any insight very appreciated,
 Sandy


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org