I would recommend against writing unit tests for Spark programs, and instead focus on integration tests of jobs or pipelines of several jobs. You can still use a unit test framework to execute them. Perhaps this is what you meant.
You can use any of the popular unit test frameworks to drive your tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it gives you choice of TDD vs BDD, and it is also well integrated with IntelliJ. I would also recommend against using testing frameworks tied to a processing technology, such as Spark Testing Base. Although it does seem well crafted, and makes it easy to get started with testing, there are drawbacks: 1. I/O routines are not tested. Bundled test frameworks typically do not materialise datasets on storage, but pass them directly in memory. (I have not verified this for Spark Testing Base, but it looks so.) I/O routines are therefore not exercised, and they often hide bugs, e.g. related to serialisation. 2. You create a strong coupling between processing technology and your tests. If you decide to change processing technology (which can happen soon in this fast paced world...), you need to rewrite your tests. Therefore, during a migration process, the tests cannot detect bugs introduced in migration, and help you migrate fast. I recommend that you instead materialise input datasets on local disk, run your Spark job, which writes output datasets to local disk, read output from disk, and verify the results. You can still use Spark routines to read and write input and output datasets. A Spark context is expensive to create, so for speed, I would recommend reusing the Spark context between input generation, running the job, and reading output. This is easy to set up, so you don't need a dedicated framework for it. Just put your common boilerplate in a shared test trait or base class. In the future, when you want to replace your Spark job with something shinier, you can still use the old tests, and only replace the part that runs your job, giving you some protection from regression bugs. Testing Spark Streaming applications is a different beast, and you can probably not reuse much from your batch testing. For testing streaming applications, I recommend that you run your application inside a unit test framework, e.g, Scalatest, and have the test setup create a fixture that includes your input and output components. For example, if your streaming application consumes from Kafka and updates tables in Cassandra, spin up single node instances of Kafka and Cassandra on your local machine, and connect your application to them. Then feed input to a Kafka topic, and wait for the result to appear in Cassandra. With this setup, your application still runs in Scalatest, the tests run without custom setup in maven/sbt/gradle, and you can easily run and debug inside IntelliJ. Docker is suitable for spinning up external components. If you use Kafka, the Docker image spotify/kafka is useful, since it bundles Zookeeper. When waiting for output to appear, don't sleep for a long time and then check, since it will slow down your tests. Instead enter a loop where you poll for the results and sleep for a few milliseconds in between, with a long timeout (~30s) before the test fails with a timeout. This poll and sleep strategy both makes tests quick in successful cases, but still robust to occasional delays. The strategy does not work if you want to test for absence, e.g. ensure that a particular message if filtered. You can work around it by adding another message afterwards and polling for its effect before testing for absence of the first. Be aware that messages can be processed out of order in Spark Streaming depending on partitioning, however. I have tested Spark applications with both strategies described above, and it is straightforward to set up. Let me know if you want clarifications or assistance. Regards, Lars Albertsson Data engineering consultant www.mapflat.com +46 70 7687109 On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote: > Hi, > > What is a good unit testing framework for Spark batch/streaming jobs? I have > core spark, spark sql with dataframes and streaming api getting used. Any > good framework to cover unit tests for these APIs? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org