this is a good summary -Have you thought of publishing it at the end of a URL 
for others to refer to

> On 18 Mar 2016, at 07:05, Lars Albertsson <la...@mapflat.com> wrote:
> 
> I would recommend against writing unit tests for Spark programs, and
> instead focus on integration tests of jobs or pipelines of several
> jobs. You can still use a unit test framework to execute them. Perhaps
> this is what you meant.
> 
> You can use any of the popular unit test frameworks to drive your
> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
> gives you choice of TDD vs BDD, and it is also well integrated with
> IntelliJ.
> 
> I would also recommend against using testing frameworks tied to a
> processing technology, such as Spark Testing Base. Although it does
> seem well crafted, and makes it easy to get started with testing,
> there are drawbacks:
> 
> 1. I/O routines are not tested. Bundled test frameworks typically do
> not materialise datasets on storage, but pass them directly in memory.
> (I have not verified this for Spark Testing Base, but it looks so.)
> I/O routines are therefore not exercised, and they often hide bugs,
> e.g. related to serialisation.
> 
> 2. You create a strong coupling between processing technology and your
> tests. If you decide to change processing technology (which can happen
> soon in this fast paced world...), you need to rewrite your tests.
> Therefore, during a migration process, the tests cannot detect bugs
> introduced in migration, and help you migrate fast.
> 
> I recommend that you instead materialise input datasets on local disk,
> run your Spark job, which writes output datasets to local disk, read
> output from disk, and verify the results. You can still use Spark
> routines to read and write input and output datasets. A Spark context
> is expensive to create, so for speed, I would recommend reusing the
> Spark context between input generation, running the job, and reading
> output.
> 
> This is easy to set up, so you don't need a dedicated framework for
> it. Just put your common boilerplate in a shared test trait or base
> class.
> 
> In the future, when you want to replace your Spark job with something
> shinier, you can still use the old tests, and only replace the part
> that runs your job, giving you some protection from regression bugs.
> 
> 
> Testing Spark Streaming applications is a different beast, and you can
> probably not reuse much from your batch testing.
> 
> For testing streaming applications, I recommend that you run your
> application inside a unit test framework, e.g, Scalatest, and have the
> test setup create a fixture that includes your input and output
> components. For example, if your streaming application consumes from
> Kafka and updates tables in Cassandra, spin up single node instances
> of Kafka and Cassandra on your local machine, and connect your
> application to them. Then feed input to a Kafka topic, and wait for
> the result to appear in Cassandra.
> 
> With this setup, your application still runs in Scalatest, the tests
> run without custom setup in maven/sbt/gradle, and you can easily run
> and debug inside IntelliJ.
> 
> Docker is suitable for spinning up external components. If you use
> Kafka, the Docker image spotify/kafka is useful, since it bundles
> Zookeeper.
> 
> When waiting for output to appear, don't sleep for a long time and
> then check, since it will slow down your tests. Instead enter a loop
> where you poll for the results and sleep for a few milliseconds in
> between, with a long timeout (~30s) before the test fails with a
> timeout.

org.scalatest.concurrent.Eventually is your friend there

      eventually(stdTimeout, stdInterval) {
        listRestAPIApplications(connector, webUI, true) should 
contain(expectedAppId)
      }

It has good exponential backoff, for fast initial success without using too 
much CPU later, and is simple to use

If it has weaknesses in my tests, they are 

1. it will retry on all exceptions, rather than assertions. If there's a bug in 
the test code then it manifests as a timeout. ( I think I could play with 
Suite.anExceptionThatShouldCauseAnAbort()) here.
2. it's timeout action is simply to rethrow the fault; I like to exec a closure 
to grab more diagnostics
3. It doesn't support some fail-fast exception which your code can raise to 
indicate that the desired state is never going to be reached, and so the test 
should fail fast. Here a new exception and another entry in 
anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down and 
play with that some more.


> 
> This poll and sleep strategy both makes tests quick in successful
> cases, but still robust to occasional delays. The strategy does not
> work if you want to test for absence, e.g. ensure that a particular
> message if filtered. You can work around it by adding another message
> afterwards and polling for its effect before testing for absence of
> the first. Be aware that messages can be processed out of order in
> Spark Streaming depending on partitioning, however.
> 
> 
> I have tested Spark applications with both strategies described above,
> and it is straightforward to set up. Let me know if you want
> clarifications or assistance.
> 
> Regards,
> 
> 
> 
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> +46 70 7687109
> 
> 
> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
>> Hi,
>> 
>> What is a good unit testing framework for Spark batch/streaming jobs? I have
>> core spark, spark sql with dataframes and streaming api getting used. Any
>> good framework to cover unit tests for these APIs?
>> 
>> Thanks!
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to