Not that I can share, unfortunately. It is on my backlog to create a repository with examples, but I am currently a bit overloaded, so don't hold your breath. :-/
If you want to be notified when it happens, please follow me on Twitter or Google+. See web site below for links. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On May 18, 2016 20:14, "swetha kasireddy" <swethakasire...@gmail.com> wrote: > Hi Lars, > > Do you have any examples for the methods that you described for Spark > batch and Streaming? > > Thanks! > > On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson <la...@mapflat.com> > wrote: > >> Thanks! >> >> It is on my backlog to write a couple of blog posts on the topic, and >> eventually some example code, but I am currently busy with clients. >> >> Thanks for the pointer to Eventually - I was unaware. Fast exit on >> exception would be a useful addition, indeed. >> >> Lars Albertsson >> Data engineering consultant >> www.mapflat.com >> +46 70 7687109 >> >> On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran <ste...@hortonworks.com> >> wrote: >> > this is a good summary -Have you thought of publishing it at the end of >> a URL for others to refer to >> > >> >> On 18 Mar 2016, at 07:05, Lars Albertsson <la...@mapflat.com> wrote: >> >> >> >> I would recommend against writing unit tests for Spark programs, and >> >> instead focus on integration tests of jobs or pipelines of several >> >> jobs. You can still use a unit test framework to execute them. Perhaps >> >> this is what you meant. >> >> >> >> You can use any of the popular unit test frameworks to drive your >> >> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it >> >> gives you choice of TDD vs BDD, and it is also well integrated with >> >> IntelliJ. >> >> >> >> I would also recommend against using testing frameworks tied to a >> >> processing technology, such as Spark Testing Base. Although it does >> >> seem well crafted, and makes it easy to get started with testing, >> >> there are drawbacks: >> >> >> >> 1. I/O routines are not tested. Bundled test frameworks typically do >> >> not materialise datasets on storage, but pass them directly in memory. >> >> (I have not verified this for Spark Testing Base, but it looks so.) >> >> I/O routines are therefore not exercised, and they often hide bugs, >> >> e.g. related to serialisation. >> >> >> >> 2. You create a strong coupling between processing technology and your >> >> tests. If you decide to change processing technology (which can happen >> >> soon in this fast paced world...), you need to rewrite your tests. >> >> Therefore, during a migration process, the tests cannot detect bugs >> >> introduced in migration, and help you migrate fast. >> >> >> >> I recommend that you instead materialise input datasets on local disk, >> >> run your Spark job, which writes output datasets to local disk, read >> >> output from disk, and verify the results. You can still use Spark >> >> routines to read and write input and output datasets. A Spark context >> >> is expensive to create, so for speed, I would recommend reusing the >> >> Spark context between input generation, running the job, and reading >> >> output. >> >> >> >> This is easy to set up, so you don't need a dedicated framework for >> >> it. Just put your common boilerplate in a shared test trait or base >> >> class. >> >> >> >> In the future, when you want to replace your Spark job with something >> >> shinier, you can still use the old tests, and only replace the part >> >> that runs your job, giving you some protection from regression bugs. >> >> >> >> >> >> Testing Spark Streaming applications is a different beast, and you can >> >> probably not reuse much from your batch testing. >> >> >> >> For testing streaming applications, I recommend that you run your >> >> application inside a unit test framework, e.g, Scalatest, and have the >> >> test setup create a fixture that includes your input and output >> >> components. For example, if your streaming application consumes from >> >> Kafka and updates tables in Cassandra, spin up single node instances >> >> of Kafka and Cassandra on your local machine, and connect your >> >> application to them. Then feed input to a Kafka topic, and wait for >> >> the result to appear in Cassandra. >> >> >> >> With this setup, your application still runs in Scalatest, the tests >> >> run without custom setup in maven/sbt/gradle, and you can easily run >> >> and debug inside IntelliJ. >> >> >> >> Docker is suitable for spinning up external components. If you use >> >> Kafka, the Docker image spotify/kafka is useful, since it bundles >> >> Zookeeper. >> >> >> >> When waiting for output to appear, don't sleep for a long time and >> >> then check, since it will slow down your tests. Instead enter a loop >> >> where you poll for the results and sleep for a few milliseconds in >> >> between, with a long timeout (~30s) before the test fails with a >> >> timeout. >> > >> > org.scalatest.concurrent.Eventually is your friend there >> > >> > eventually(stdTimeout, stdInterval) { >> > listRestAPIApplications(connector, webUI, true) should >> contain(expectedAppId) >> > } >> > >> > It has good exponential backoff, for fast initial success without using >> too much CPU later, and is simple to use >> > >> > If it has weaknesses in my tests, they are >> > >> > 1. it will retry on all exceptions, rather than assertions. If there's >> a bug in the test code then it manifests as a timeout. ( I think I could >> play with Suite.anExceptionThatShouldCauseAnAbort()) here. >> > 2. it's timeout action is simply to rethrow the fault; I like to exec a >> closure to grab more diagnostics >> > 3. It doesn't support some fail-fast exception which your code can >> raise to indicate that the desired state is never going to be reached, and >> so the test should fail fast. Here a new exception and another entry in >> anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down >> and play with that some more. >> > >> > >> >> >> >> This poll and sleep strategy both makes tests quick in successful >> >> cases, but still robust to occasional delays. The strategy does not >> >> work if you want to test for absence, e.g. ensure that a particular >> >> message if filtered. You can work around it by adding another message >> >> afterwards and polling for its effect before testing for absence of >> >> the first. Be aware that messages can be processed out of order in >> >> Spark Streaming depending on partitioning, however. >> >> >> >> >> >> I have tested Spark applications with both strategies described above, >> >> and it is straightforward to set up. Let me know if you want >> >> clarifications or assistance. >> >> >> >> Regards, >> >> >> >> >> >> >> >> Lars Albertsson >> >> Data engineering consultant >> >> www.mapflat.com >> >> +46 70 7687109 >> >> >> >> >> >> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote: >> >>> Hi, >> >>> >> >>> What is a good unit testing framework for Spark batch/streaming jobs? >> I have >> >>> core spark, spark sql with dataframes and streaming api getting used. >> Any >> >>> good framework to cover unit tests for these APIs? >> >>> >> >>> Thanks! >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html >> >>> Sent from the Apache Spark User List mailing list archive at >> Nabble.com. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >>> For additional commands, e-mail: user-h...@spark.apache.org >> >>> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> > >> > >