Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

dueckm Tue, 02 Aug 2016 05:13:02 -0700

Hello,

I built a prototype that uses join and groupBy operations via Spark RDD API.
Recently I migrated it to the Dataset API. Now it runs much slower than with
the original RDD implementation. Did I do something wrong here? Or is this
the price I have to pay for the more convienient API?
Is there a known solution to deal with this effect (eg configuration via
"spark.sql.shuffle.partitions" - but how could I determine the correct
value)?
In my prototype I use Java Beans with a lot of attributes. Does this slow
down Spark-operations with Datasets?


Here I have an simple example, that shows the difference: 
JoinGroupByTest.zip
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/JoinGroupByTest.zip>
  
- I build 2 RDDs and join and group them. Afterwards I count and display the
joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
- When I do the same actions with Datasets it takes approximately 40 times
as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).  

Thank you very much for your help.
Matthias

PS1: is a duplicate issue to
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27445.html
because the the sign-up confirmation process wa snot completed when posting
this topic.


PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
RDD implementation, jobs 2/3 to Dataset): 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/jobs.png> 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/Job_RDD_Details.png>
 
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/Job_Dataset_Details.png>
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27448.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?

Reply via email to