Re: Unit testing framework for Spark Jobs?

Lars Albertsson Sat, 19 Mar 2016 02:01:31 -0700

I would recommend against writing unit tests for Spark programs, and
instead focus on integration tests of jobs or pipelines of several
jobs. You can still use a unit test framework to execute them. Perhaps
this is what you meant.

You can use any of the popular unit test frameworks to drive your
tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
gives you choice of TDD vs BDD, and it is also well integrated with
IntelliJ.

I would also recommend against using testing frameworks tied to a
processing technology, such as Spark Testing Base. Although it does
seem well crafted, and makes it easy to get started with testing,
there are drawbacks:

1. I/O routines are not tested. Bundled test frameworks typically do
not materialise datasets on storage, but pass them directly in memory.
(I have not verified this for Spark Testing Base, but it looks so.)
I/O routines are therefore not exercised, and they often hide bugs,
e.g. related to serialisation.

2. You create a strong coupling between processing technology and your
tests. If you decide to change processing technology (which can happen
soon in this fast paced world...), you need to rewrite your tests.
Therefore, during a migration process, the tests cannot detect bugs
introduced in migration, and help you migrate fast.

I recommend that you instead materialise input datasets on local disk,
run your Spark job, which writes output datasets to local disk, read
output from disk, and verify the results. You can still use Spark
routines to read and write input and output datasets. A Spark context
is expensive to create, so for speed, I would recommend reusing the
Spark context between input generation, running the job, and reading
output.

This is easy to set up, so you don't need a dedicated framework for
it. Just put your common boilerplate in a shared test trait or base
class.

In the future, when you want to replace your Spark job with something
shinier, you can still use the old tests, and only replace the part
that runs your job, giving you some protection from regression bugs.

Testing Spark Streaming applications is a different beast, and you can
probably not reuse much from your batch testing.

For testing streaming applications, I recommend that you run your
application inside a unit test framework, e.g, Scalatest, and have the
test setup create a fixture that includes your input and output
components. For example, if your streaming application consumes from
Kafka and updates tables in Cassandra, spin up single node instances
of Kafka and Cassandra on your local machine, and connect your
application to them. Then feed input to a Kafka topic, and wait for
the result to appear in Cassandra.

With this setup, your application still runs in Scalatest, the tests
run without custom setup in maven/sbt/gradle, and you can easily run
and debug inside IntelliJ.

Docker is suitable for spinning up external components. If you use
Kafka, the Docker image spotify/kafka is useful, since it bundles
Zookeeper.

When waiting for output to appear, don't sleep for a long time and
then check, since it will slow down your tests. Instead enter a loop
where you poll for the results and sleep for a few milliseconds in
between, with a long timeout (~30s) before the test fails with a
timeout.

This poll and sleep strategy both makes tests quick in successful
cases, but still robust to occasional delays. The strategy does not
work if you want to test for absence, e.g. ensure that a particular
message if filtered. You can work around it by adding another message
afterwards and polling for its effect before testing for absence of
the first. Be aware that messages can be processed out of order in
Spark Streaming depending on partitioning, however.

I have tested Spark applications with both strategies described above,
and it is straightforward to set up. Let me know if you want
clarifications or assistance.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
> Hi,
>
> What is a good unit testing framework for Spark batch/streaming jobs? I have
> core spark, spark sql with dataframes and streaming api getting used. Any
> good framework to cover unit tests for these APIs?
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Unit testing framework for Spark Jobs?

Reply via email to