Agreed with the statement in quotes below whether one wants to do unit tests or not It is a good practice to write code that way. But I think the more painful and tedious task is to mock/emulate all the nodes such as spark workers/master/hdfs/input source stream and all that. I wish there is something really simple. Perhaps the simplest thing to do is just to do integration tests which also tests the transformations/business logic. This way I can spawn a small cluster and run my tests and bring my cluster down when I am done. And sure if the cluster isn't available then I can't run the tests however some node should be available even to run a single process. I somehow feel like we may doing too much work to fit into the archaic definition of unit tests.
"Basically you abstract your transformations to take in a dataframe and return one, then you assert on the returned df " this On Tue, Mar 7, 2017 at 11:14 AM, Michael Armbrust <mich...@databricks.com> wrote: > Basically you abstract your transformations to take in a dataframe and >> return one, then you assert on the returned df >> > > +1 to this suggestion. This is why we wanted streaming and batch > dataframes to share the same API. >