Hello! Thank you very much for your helpful answer and for the very good job performed in spark-testing-base . I managed to perform unit testing with spark-testing-base library as the provided article and also get inspired from
https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/java/com/holdenkarau/spark/testing/SampleJavaRDDTest.java . I had some concerns regarding on how to deal with compairing the RDDs that come from Dataframe and the one that come from jsc().parallelize method. My workflow tests is as follow: 1. Get the data from a parquet file as dataframe 2. Convert dataframe to toJavaRDD() 3. perform some mapping on the JavaRdd 4. Check whether the resulted mapped rdd is equal with the expected one (retrieved from a text file) I performed the above test with following code snippet JavaRDD<MyCustomer> expected = jsc().parallelize(input_from_text_file); SparkSession spark = SparkSession.builder().getOrCreate(); JavaRDD<Row> input = spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD(); JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input); JavaRDDComparisons.assertRDDEquals(expected, result); The above tests failed failed, even through the data is the same. By debugging the code, I observed that the data from that came from the DataFrame didn't have the same order as the one that came from jsc().parallelize(text_file). So, I suppose that the issue came from the fact that the SparkSession and jsc() don't share the same SparkContext (there is a warning about this when running the program). Therefore I came to the solution, to use the same jsc for both of the expected and the result. With this solution the assertion succeeded as expected. List<Row> df =spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD().collect(); JavaRDD<Row> input = jsc().parallelize(df); JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input); JavaRDDComparisons.assertRDDEquals(expected, result); My questions are: 1. what is the best solution to deal with RDDs comparison when the RDDs are built from Dataframes and when they are tested with RDDs obtained via jsc().parallelize()? 2. Is the above solution a suitable one? I look forward for your answers. Regards, Florin On Wed, May 30, 2018 at 3:11 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > So Jessie has an excellent blog post on how to use it with Java > applications - > http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ > > On Wed, May 30, 2018 at 4:14 AM Spico Florin <spicoflo...@gmail.com> > wrote: > >> Hello! >> I'm also looking for unit testing spark Java application. I've seen the >> great work done in spark-testing-base but it seemed to me that I could >> not use for Spark Java applications. >> Only spark scala applications are supported? >> Thanks. >> Regards, >> Florin >> >> On Wed, May 23, 2018 at 8:07 AM, umargeek <umarfarooq.tech...@gmail.com> >> wrote: >> >>> Hi Steve, >>> >>> you can try out pytest-spark plugin if your writing programs using >>> pyspark >>> ,please find below link for reference. >>> >>> https://github.com/malexer/pytest-spark >>> <https://github.com/malexer/pytest-spark> >>> >>> Thanks, >>> Umar >>> >>> >>> >>> -- >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> -- > Twitter: https://twitter.com/holdenkarau >