Re: Spark Unit Testing
Hi Emre, thanks for the help will have a look. Cheers! On Tue, Apr 21, 2015 at 1:46 PM, Emre Sevinc emre.sev...@gmail.com wrote: Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming - http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs -- Emre Sevinç http://www.bigindustries.be/ On Tue, Apr 21, 2015 at 1:26 PM, James King jakwebin...@gmail.com wrote: I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStreamString, String to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk -- Emre Sevinc
Re: Spark Unit Testing
Hello James, Did you check the following resources: - https://github.com/apache/spark/tree/master/streaming/src/test/java/org/apache/spark/streaming - http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs -- Emre Sevinç http://www.bigindustries.be/ On Tue, Apr 21, 2015 at 1:26 PM, James King jakwebin...@gmail.com wrote: I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStreamString, String to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk -- Emre Sevinc
Spark Unit Testing
I'm trying to write some unit tests for my spark code. I need to pass a JavaPairDStreamString, String to my spark class. Is there a way to create a JavaPairDStream using Java API? Also is there a good resource that covers an approach (or approaches) for unit testing using Java. Regards jk
Re: Spark unit testing best practices
Thanks for the answers! On a concrete example, here is what I did to test my (wrong :) ) hypothesis before writing my email: class SomethingNotSerializable { def process(a: Int): Int = 2 *a } object NonSerializableClosure extends App { val sc = new spark.SparkContext( local, SerTest, /home/xandrew/spark-0.9.0-incubating, Seq(target/scala-2.10/sparktests_2.10-0.1-SNAPSHOT.jar)) val sns = new SomethingNotSerializable println(sc.parallelize(Seq(1,2,3)) .map(sns.process(_)) .reduce(_ + _)) } This program prints 12 correctly. If I change local to point to my spark master the code fails on the worker with a NullPointerException in the line .map(sns.process(_)). But I have to say that my original assumption that this is a serialization issue was wrong, as adding extends Serializable to my class does _not_ solve the problem in non-local mode. This seems to be something more convoluted, the sns reference in my closure is probably not stored by value, instead I guess it's a by name reference to NonSerializableClosure.sns. I'm a bit surprised why this results in a NullPointerException instead of some error when trying to run the constructor of this object on the worker. Maybe something to do with the magic of App. Anyways, while this is indeed an example of an error that doesn't manifest in local mode, I guess it turns out to be convoluted enough that we won't worry about it for now, use local in tests, and I'll ask again if we see some actual prod vs unittest problems. On using local-cluster, this does sound like exactly what I had in mind. But it doesn't seem to work for application developers. It seems to assume you are running within a spark build (it fails while looking for the file bin/compute-classpath.sh). So maybe that's a reason it's not documented... Cheers, Andras On Wed, May 14, 2014 at 7:58 PM, Mark Hamstra m...@clearstorydata.comwrote: Local mode does serDe, so it should expose serialization problems. On Wed, May 14, 2014 at 10:53 AM, Philip Ogren philip.og...@oracle.comwrote: Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster because of different serialization requirements between the two. Perhaps it is different when using Kryo On 05/14/2014 04:34 AM, Andras Nemeth wrote: E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster.
Re: Spark unit testing best practices
+1, at least with current code just watch the log printed by DAGScheduler… -- Nan Zhu On Wednesday, May 14, 2014 at 1:58 PM, Mark Hamstra wrote: serDe
Spark unit testing best practices
Hi, Spark's local mode is great to create simple unit tests for our spark logic. The disadvantage however is that certain types of problems are never exposed in local mode because things never need to be put on the wire. E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster. Other example is kryo: I'd like to use setRegistrationRequired(true) to avoid any hidden performance problems due to forgotten registration. And of course I'd like things to fail in tests. But it won't happen because we never actually need to serialize the RDDs in local mode. So, is there some good solution to the above problems? Is there some local-like mode which simulates serializations as well? Or is there an easy way to start up *from code* a standalone spark cluster on the machine running the unit test? Thanks, Andras
Re: Spark unit testing best practices
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*].r can you running your tests with a master URL of local-cluster[2,2,512] to see if that does serialization? On Wed, May 14, 2014 at 3:34 AM, Andras Nemeth andras.nem...@lynxanalytics.com wrote: Hi, Spark's local mode is great to create simple unit tests for our spark logic. The disadvantage however is that certain types of problems are never exposed in local mode because things never need to be put on the wire. E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster. Other example is kryo: I'd like to use setRegistrationRequired(true) to avoid any hidden performance problems due to forgotten registration. And of course I'd like things to fail in tests. But it won't happen because we never actually need to serialize the RDDs in local mode. So, is there some good solution to the above problems? Is there some local-like mode which simulates serializations as well? Or is there an easy way to start up *from code* a standalone spark cluster on the machine running the unit test? Thanks, Andras
Re: Spark unit testing best practices
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster because of different serialization requirements between the two. Perhaps it is different when using Kryo On 05/14/2014 04:34 AM, Andras Nemeth wrote: E.g. if I accidentally use a closure which has something non-serializable in it, then my test will happily succeed in local mode but go down in flames on a real cluster.