Re: TDD in Spark
Thanks for all the suggestion. Very Helpful. On 17 January 2017 at 22:04, Lars Albertssonwrote: > My advice, short version: > * Start by testing one job per test. > * Use Scalatest or a standard framework. > * Generate input datasets with Spark routines, write to local file. > * Run job with local master. > * Read output with Spark routines, validate only the fields you care > about for the test case at hand. > * Focus on building a functional regression test suite with small test > cases before testing with large input datasets. The former improves > productivity more. > > Avoid: > * Test frameworks coupled to your processing technology - they will > make it difficult to switch. > * Spending much effort to small unit tests. Internal interfaces in > Spark tend to be volatile, and testing against them results in high > maintenance costs. > * Input files checked in to version control. They are difficult to > maintain. Generate input files with code instead. > * Expected output files checked in to VC. Same reason. Validate > selected fields instead. > > For a longer answer, please search for my previous posts to the user > list, or watch this presentation: https://vimeo.com/192429554 > > Slides at http://www.slideshare.net/lallea/test-strategies-for- > data-processing-pipelines-67244458 > > > Regards, > > > > Lars Albertsson > Data engineering consultant > www.mapflat.com > https://twitter.com/lalleal > +46 70 7687109 > Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com > > > On Sun, Jan 15, 2017 at 7:14 PM, A Shaikh wrote: > > Whats the most popular Testing approach for Spark App. I am looking > > something in the line of TDD. >
Re: TDD in Spark
My advice, short version: * Start by testing one job per test. * Use Scalatest or a standard framework. * Generate input datasets with Spark routines, write to local file. * Run job with local master. * Read output with Spark routines, validate only the fields you care about for the test case at hand. * Focus on building a functional regression test suite with small test cases before testing with large input datasets. The former improves productivity more. Avoid: * Test frameworks coupled to your processing technology - they will make it difficult to switch. * Spending much effort to small unit tests. Internal interfaces in Spark tend to be volatile, and testing against them results in high maintenance costs. * Input files checked in to version control. They are difficult to maintain. Generate input files with code instead. * Expected output files checked in to VC. Same reason. Validate selected fields instead. For a longer answer, please search for my previous posts to the user list, or watch this presentation: https://vimeo.com/192429554 Slides at http://www.slideshare.net/lallea/test-strategies-for-data-processing-pipelines-67244458 Regards, Lars Albertsson Data engineering consultant www.mapflat.com https://twitter.com/lalleal +46 70 7687109 Calendar: https://goo.gl/6FBtlS, https://freebusy.io/la...@mapflat.com On Sun, Jan 15, 2017 at 7:14 PM, A Shaikhwrote: > Whats the most popular Testing approach for Spark App. I am looking > something in the line of TDD. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: TDD in Spark
I've also written a small blog post that may help you out: https://medium.com/@therevoltingx/test-driven-development-w-apache-spark-746082b44941#.ia6stbl6n On Sun, Jan 15, 2017 at 12:13 PM, Silvio Fioritowrote: > You should check out Holden’s excellent spark-testing-base package: > https://github.com/holdenk/spark-testing-base > > > > > > From: A Shaikh > Date: Sunday, January 15, 2017 at 1:14 PM > To: User > Subject: TDD in Spark > > > > Whats the most popular Testing approach for Spark App. I am looking > something in the line of TDD. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: TDD in Spark
You should check out Holden’s excellent spark-testing-base package: https://github.com/holdenk/spark-testing-base From: A ShaikhDate: Sunday, January 15, 2017 at 1:14 PM To: User Subject: TDD in Spark Whats the most popular Testing approach for Spark App. I am looking something in the line of TDD.