Hello!
  Thank you very much for your helpful answer and for the very good job
performed in spark-testing-base . I managed to perform unit testing with
spark-testing-base library as the provided article and also get inspired
from

https://github.com/holdenk/spark-testing-base/blob/master/src/test/1.3/java/com/holdenkarau/spark/testing/SampleJavaRDDTest.java
.


I had some concerns regarding on how to deal with compairing the RDDs that
come from Dataframe and the one that come from jsc().parallelize method.

My workflow tests is as follow:
1. Get the data from a parquet file as dataframe
2. Convert dataframe  to toJavaRDD()
3. perform some mapping on the JavaRdd
4. Check whether the resulted mapped rdd  is equal with the expected one
(retrieved from a text file)

I performed the above test with following code snippet

 JavaRDD<MyCustomer> expected = jsc().parallelize(input_from_text_file);
SparkSession spark = SparkSession.builder().getOrCreate();

JavaRDD<Row> input =

spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD();

JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input);
 JavaRDDComparisons.assertRDDEquals(expected, result);

The above tests failed failed, even through the data is the same. By
debugging the code, I observed that the data from that came from the
DataFrame didn't have the same order as the one that came from
jsc().parallelize(text_file).

So, I suppose that the issue came from the fact that the SparkSession and
jsc() don't share the same SparkContext (there is a warning about this when
running the program).

Therefore I came to the solution, to use the same jsc for both of the
expected and the result. With this solution the assertion succeeded as
expected.

  List<Row> df
=spark.read().parquet("src/test/resources/test_data.parquet").toJavaRDD().collect();
    JavaRDD<Row> input = jsc().parallelize(df);

JavaRDD<MyCustomer> result = MyDriver.convertToMyCustomerData(input);
 JavaRDDComparisons.assertRDDEquals(expected, result);


My questions are:
1. what is the best solution to deal with RDDs comparison  when the RDDs
are built from Dataframes and when they are tested with RDDs obtained via
jsc().parallelize()?
2. Is the above solution a suitable one?

I look forward for your answers.

Regards,
  Florin







On Wed, May 30, 2018 at 3:11 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> So Jessie has an excellent blog post on how to use it with Java
> applications -
> http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/
>
> On Wed, May 30, 2018 at 4:14 AM Spico Florin <spicoflo...@gmail.com>
> wrote:
>
>> Hello!
>>   I'm also looking for unit testing spark Java application. I've seen the
>> great work done in  spark-testing-base but it seemed to me that I could
>> not use for Spark Java applications.
>> Only spark scala applications are supported?
>> Thanks.
>> Regards,
>>  Florin
>>
>> On Wed, May 23, 2018 at 8:07 AM, umargeek <umarfarooq.tech...@gmail.com>
>> wrote:
>>
>>> Hi Steve,
>>>
>>> you can try out pytest-spark plugin if your writing programs using
>>> pyspark
>>> ,please find below link for reference.
>>>
>>> https://github.com/malexer/pytest-spark
>>> <https://github.com/malexer/pytest-spark>
>>>
>>> Thanks,
>>> Umar
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
>

Reply via email to