Re: Unit testing framework for Spark Jobs?

Todd Nist Wed, 18 May 2016 15:21:33 -0700

Perhaps these may be of some use:

https://github.com/mkuthan/example-spark
http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/
https://github.com/holdenk/spark-testing-base


On Wed, May 18, 2016 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:

> Hi Lars,
>
> Do you have any examples for the methods that you described for Spark
> batch and Streaming?
>
> Thanks!
>
> On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson <la...@mapflat.com>
> wrote:
>
>> Thanks!
>>
>> It is on my backlog to write a couple of blog posts on the topic, and
>> eventually some example code, but I am currently busy with clients.
>>
>> Thanks for the pointer to Eventually - I was unaware. Fast exit on
>> exception would be a useful addition, indeed.
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>> On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>> > this is a good summary -Have you thought of publishing it at the end of
>> a URL for others to refer to
>> >
>> >> On 18 Mar 2016, at 07:05, Lars Albertsson <la...@mapflat.com> wrote:
>> >>
>> >> I would recommend against writing unit tests for Spark programs, and
>> >> instead focus on integration tests of jobs or pipelines of several
>> >> jobs. You can still use a unit test framework to execute them. Perhaps
>> >> this is what you meant.
>> >>
>> >> You can use any of the popular unit test frameworks to drive your
>> >> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
>> >> gives you choice of TDD vs BDD, and it is also well integrated with
>> >> IntelliJ.
>> >>
>> >> I would also recommend against using testing frameworks tied to a
>> >> processing technology, such as Spark Testing Base. Although it does
>> >> seem well crafted, and makes it easy to get started with testing,
>> >> there are drawbacks:
>> >>
>> >> 1. I/O routines are not tested. Bundled test frameworks typically do
>> >> not materialise datasets on storage, but pass them directly in memory.
>> >> (I have not verified this for Spark Testing Base, but it looks so.)
>> >> I/O routines are therefore not exercised, and they often hide bugs,
>> >> e.g. related to serialisation.
>> >>
>> >> 2. You create a strong coupling between processing technology and your
>> >> tests. If you decide to change processing technology (which can happen
>> >> soon in this fast paced world...), you need to rewrite your tests.
>> >> Therefore, during a migration process, the tests cannot detect bugs
>> >> introduced in migration, and help you migrate fast.
>> >>
>> >> I recommend that you instead materialise input datasets on local disk,
>> >> run your Spark job, which writes output datasets to local disk, read
>> >> output from disk, and verify the results. You can still use Spark
>> >> routines to read and write input and output datasets. A Spark context
>> >> is expensive to create, so for speed, I would recommend reusing the
>> >> Spark context between input generation, running the job, and reading
>> >> output.
>> >>
>> >> This is easy to set up, so you don't need a dedicated framework for
>> >> it. Just put your common boilerplate in a shared test trait or base
>> >> class.
>> >>
>> >> In the future, when you want to replace your Spark job with something
>> >> shinier, you can still use the old tests, and only replace the part
>> >> that runs your job, giving you some protection from regression bugs.
>> >>
>> >>
>> >> Testing Spark Streaming applications is a different beast, and you can
>> >> probably not reuse much from your batch testing.
>> >>
>> >> For testing streaming applications, I recommend that you run your
>> >> application inside a unit test framework, e.g, Scalatest, and have the
>> >> test setup create a fixture that includes your input and output
>> >> components. For example, if your streaming application consumes from
>> >> Kafka and updates tables in Cassandra, spin up single node instances
>> >> of Kafka and Cassandra on your local machine, and connect your
>> >> application to them. Then feed input to a Kafka topic, and wait for
>> >> the result to appear in Cassandra.
>> >>
>> >> With this setup, your application still runs in Scalatest, the tests
>> >> run without custom setup in maven/sbt/gradle, and you can easily run
>> >> and debug inside IntelliJ.
>> >>
>> >> Docker is suitable for spinning up external components. If you use
>> >> Kafka, the Docker image spotify/kafka is useful, since it bundles
>> >> Zookeeper.
>> >>
>> >> When waiting for output to appear, don't sleep for a long time and
>> >> then check, since it will slow down your tests. Instead enter a loop
>> >> where you poll for the results and sleep for a few milliseconds in
>> >> between, with a long timeout (~30s) before the test fails with a
>> >> timeout.
>> >
>> > org.scalatest.concurrent.Eventually is your friend there
>> >
>> > eventually(stdTimeout, stdInterval) {
>> > listRestAPIApplications(connector, webUI, true) should
>> contain(expectedAppId)
>> > }
>> >
>> > It has good exponential backoff, for fast initial success without using
>> too much CPU later, and is simple to use
>> >
>> > If it has weaknesses in my tests, they are
>> >
>> > 1. it will retry on all exceptions, rather than assertions. If there's
>> a bug in the test code then it manifests as a timeout. ( I think I could
>> play with Suite.anExceptionThatShouldCauseAnAbort()) here.
>> > 2. it's timeout action is simply to rethrow the fault; I like to exec a
>> closure to grab more diagnostics
>> > 3. It doesn't support some fail-fast exception which your code can
>> raise to indicate that the desired state is never going to be reached, and
>> so the test should fail fast. Here a new exception and another entry in
>> anExceptionThatShouldCauseAnAbort() may be the answer. I should sit down
>> and play with that some more.
>> >
>> >
>> >>
>> >> This poll and sleep strategy both makes tests quick in successful
>> >> cases, but still robust to occasional delays. The strategy does not
>> >> work if you want to test for absence, e.g. ensure that a particular
>> >> message if filtered. You can work around it by adding another message
>> >> afterwards and polling for its effect before testing for absence of
>> >> the first. Be aware that messages can be processed out of order in
>> >> Spark Streaming depending on partitioning, however.
>> >>
>> >>
>> >> I have tested Spark applications with both strategies described above,
>> >> and it is straightforward to set up. Let me know if you want
>> >> clarifications or assistance.
>> >>
>> >> Regards,
>> >>
>> >>
>> >>
>> >> Lars Albertsson
>> >> Data engineering consultant
>> >> www.mapflat.com
>> >> +46 70 7687109
>> >>
>> >>
>> >> On Wed, Mar 2, 2016 at 6:54 PM, SRK <swethakasire...@gmail.com> wrote:
>> >>> Hi,
>> >>>
>> >>> What is a good unit testing framework for Spark batch/streaming jobs?
>> I have
>> >>> core spark, spark sql with dataframes and streaming api getting used.
>> Any
>> >>> good framework to cover unit tests for these APIs?
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-framework-for-Spark-Jobs-tp26380.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >>
>> >
>>
>
>

Re: Unit testing framework for Spark Jobs?

Reply via email to