Re: Spark unit testing best practices

2014-05-16 Thread Andras Nemeth
Thanks for the answers!

On a concrete example, here is what I did to test my (wrong :) ) hypothesis
before writing my email:
class SomethingNotSerializable {
  def process(a: Int): Int = 2 *a
}
object NonSerializableClosure extends App {
  val sc = new spark.SparkContext(
  local,
  SerTest,
  /home/xandrew/spark-0.9.0-incubating,
  Seq(target/scala-2.10/sparktests_2.10-0.1-SNAPSHOT.jar))
  val sns = new SomethingNotSerializable
  println(sc.parallelize(Seq(1,2,3))
.map(sns.process(_))
.reduce(_ + _))
}

This program prints 12 correctly. If I change local to point to my spark
master the code fails on the worker with a NullPointerException in the line
.map(sns.process(_)).

But I have to say that my original assumption that this is a serialization
issue was wrong, as adding extends Serializable to my class does _not_
solve the problem in non-local mode. This seems to be something more
convoluted, the sns reference in my closure is probably not stored by
value, instead I guess it's a by name reference to
NonSerializableClosure.sns. I'm a bit surprised why this results in a
NullPointerException instead of some error when trying to run the
constructor of this object on the worker. Maybe something to do with the
magic of App.

Anyways, while this is indeed an example of an error that doesn't manifest
in local mode, I guess it turns out to be convoluted enough that we won't
worry about it for now, use local in tests, and I'll ask again if we see
some actual prod vs unittest problems.


On using local-cluster, this does sound like exactly what I had in mind.
But it doesn't seem to work for application developers. It seems to assume
you are running within a spark build (it fails while looking for the file
bin/compute-classpath.sh). So maybe that's a reason it's not documented...

Cheers,
Andras





On Wed, May 14, 2014 at 7:58 PM, Mark Hamstra m...@clearstorydata.comwrote:

 Local mode does serDe, so it should expose serialization problems.


 On Wed, May 14, 2014 at 10:53 AM, Philip Ogren philip.og...@oracle.comwrote:

 Have you actually found this to be true?  I have found Spark local mode
 to be quite good about blowing up if there is something non-serializable
 and so my unit tests have been great for detecting this.  I have never seen
 something that worked in local mode that didn't work on the cluster because
 of different serialization requirements between the two.  Perhaps it is
 different when using Kryo



 On 05/14/2014 04:34 AM, Andras Nemeth wrote:

 E.g. if I accidentally use a closure which has something
 non-serializable in it, then my test will happily succeed in local mode but
 go down in flames on a real cluster.






Re: Spark unit testing best practices

2014-05-16 Thread Nan Zhu
+1, at least with current code  

just watch the log printed by DAGScheduler…  

--  
Nan Zhu


On Wednesday, May 14, 2014 at 1:58 PM, Mark Hamstra wrote:

 serDe  



Spark unit testing best practices

2014-05-15 Thread Andras Nemeth
Hi,

Spark's local mode is great to create simple unit tests for our spark
logic. The disadvantage however is that certain types of problems are never
exposed in local mode because things never need to be put on the wire.

E.g. if I accidentally use a closure which has something non-serializable
in it, then my test will happily succeed in local mode but go down in
flames on a real cluster.

Other example is kryo: I'd like to use setRegistrationRequired(true) to
avoid any hidden performance problems due to forgotten registration. And of
course I'd like things to fail in tests. But it won't happen because we
never actually need to serialize the RDDs in local mode.

So, is there some good solution to the above problems? Is there some
local-like mode which simulates serializations as well? Or is there an easy
way to start up *from code* a standalone spark cluster on the machine
running the unit test?

Thanks,
Andras


Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster:

SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r

can you running your tests with a master URL of local-cluster[2,2,512] to
see if that does serialization?


On Wed, May 14, 2014 at 3:34 AM, Andras Nemeth 
andras.nem...@lynxanalytics.com wrote:

 Hi,

 Spark's local mode is great to create simple unit tests for our spark
 logic. The disadvantage however is that certain types of problems are never
 exposed in local mode because things never need to be put on the wire.

 E.g. if I accidentally use a closure which has something non-serializable
 in it, then my test will happily succeed in local mode but go down in
 flames on a real cluster.

 Other example is kryo: I'd like to use setRegistrationRequired(true) to
 avoid any hidden performance problems due to forgotten registration. And of
 course I'd like things to fail in tests. But it won't happen because we
 never actually need to serialize the RDDs in local mode.

 So, is there some good solution to the above problems? Is there some
 local-like mode which simulates serializations as well? Or is there an easy
 way to start up *from code* a standalone spark cluster on the machine
 running the unit test?

 Thanks,
 Andras




Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true?  I have found Spark local mode 
to be quite good about blowing up if there is something non-serializable 
and so my unit tests have been great for detecting this.  I have never 
seen something that worked in local mode that didn't work on the cluster 
because of different serialization requirements between the two.  
Perhaps it is different when using Kryo



On 05/14/2014 04:34 AM, Andras Nemeth wrote:
E.g. if I accidentally use a closure which has something 
non-serializable in it, then my test will happily succeed in local 
mode but go down in flames on a real cluster.