Agreed with the statement in quotes below whether one wants to do unit
tests or not It is a good practice to write code that way. But I think the
more painful and tedious task is to mock/emulate all the nodes such as
spark workers/master/hdfs/input source stream and all that. I wish there is
something really simple. Perhaps the simplest thing to do is just to do
integration tests which also tests the transformations/business logic. This
way I can spawn a small cluster and run my tests and bring my cluster down
when I am done. And sure if the cluster isn't available then I can't run
the tests however some node should be available even to run a single
process. I somehow feel like we may doing too much work to fit into the
archaic definition of unit tests.

 "Basically you abstract your transformations to take in a dataframe and
return one, then you assert on the returned df " this

On Tue, Mar 7, 2017 at 11:14 AM, Michael Armbrust <mich...@databricks.com>
wrote:

> Basically you abstract your transformations to take in a dataframe and
>> return one, then you assert on the returned df
>>
>
> +1 to this suggestion.  This is why we wanted streaming and batch
> dataframes to share the same API.
>

Reply via email to