Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
Hello, I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated it to the Dataset API. Now it runs much slower than with the original RDD implementation. Did I do something wrong here? Or is this a price I have to pay for the more convienient API? Is there a known solution to deal with this effect (eg configuration via "spark.sql.shuffle.partitions" - but now could I determine the correct value)? In my prototype I use Java Beans with a lot of attributes. Does this slow down Spark-operations with Datasets? Here I have an simple example, that shows the difference: JoinGroupByTest.zip <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/JoinGroupByTest.zip> - I build 2 RDDs and join and group them. Afterwards I count and display the joined RDDs. (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() ) - When I do the same actions with Datasets it takes approximately 40 times as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()). Thank you very much for your help. Matthias PS1: excuse me for sending this post more than once, but I am new to this mailing list and probably did something wrong when registering/subscribing, so my previous postings have not been accepted ... PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to RDD implementation, jobs 2/3 to Dataset): <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/jobs.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_RDD_Details.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_Dataset_Details.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
Hello, first of all - excuse me for sending this post more than once, but I am new to this mailing list and did not subscribe completely, so I suspect my previous postings will not be accepted ... I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated it to the Dataset API. Now it runs much slower than with the original RDD implementation. Did I do something wrong here? Or is this the price I have to pay for the more convienient API? Is there a known solution to deal with this effect (eg configuration via "spark.sql.shuffle.partitions" - but how could I determine the correct value)? In my prototype I use Java Beans with a lot of attributes. Does this slow down Spark-operations with Datasets? Here I have an simple example, that shows the difference: JoinGroupByTest.zip <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27459/JoinGroupByTest.zip> - I build 2 RDDs and join and group them. Afterwards I count and display the joined RDDs. (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD () ) - When I do the same actions with Datasets it takes approximately 40 times as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()). Thank you very much for your help. Matthias PS: See the appended screenshots taken from Spark UI (jobs 0/1 belong to RDD implementation, jobs 2/3 to Dataset): <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27459/jobs.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27459/Job_RDD_Details.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27459/Job_Dataset_Details.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27459.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API? [*]
Hello, I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated it to the Dataset API. Now it runs much slower than with the original RDD implementation. Did I do something wrong here? Or is this the price I have to pay for the more convienient API? Is there a known solution to deal with this effect (eg configuration via "spark.sql.shuffle.partitions" - but how could I determine the correct value)? In my prototype I use Java Beans with a lot of attributes. Does this slow down Spark-operations with Datasets? Here I have an simple example, that shows the difference: (See attached file: JoinGroupByTest.zip) - I build 2 RDDs and join and group them. Afterwards I count and display the joined RDDs. (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD () ) - When I do the same actions with Datasets it takes approximately 40 times as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()). Thank you very much for your help. Matthias PS: See the appended screenshots taken from Spark UI (jobs 0/1 belong to RDD implementation, jobs 2/3 to Dataset): Fiducia & GAD IT AG | www.fiduciagad.de AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 Frankfurt a. M. | USt-IdNr. DE 143582320 Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. Vorsitzender), Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten Pfläging, Jörg Staff Vorsitzender des Aufsichtsrats: Jürgen Brinkmann 2D782357.gif (62K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/0/2D782357.gif> 2D546574.gif (98K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/1/2D546574.gif> 2D310440.gif (126K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/2/2D310440.gif> JoinGroupByTest.zip (5K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/27449/3/JoinGroupByTest.zip> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27449.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
Hello, I built a prototype that uses join and groupBy operations via Spark RDD API. Recently I migrated it to the Dataset API. Now it runs much slower than with the original RDD implementation. Did I do something wrong here? Or is this the price I have to pay for the more convienient API? Is there a known solution to deal with this effect (eg configuration via "spark.sql.shuffle.partitions" - but how could I determine the correct value)? In my prototype I use Java Beans with a lot of attributes. Does this slow down Spark-operations with Datasets? Here I have an simple example, that shows the difference: JoinGroupByTest.zip <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/JoinGroupByTest.zip> - I build 2 RDDs and join and group them. Afterwards I count and display the joined RDDs. (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() ) - When I do the same actions with Datasets it takes approximately 40 times as long (Method de.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()). Thank you very much for your help. Matthias PS1: is a duplicate issue to http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27445.html because the the sign-up confirmation process wa snot completed when posting this topic. PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to RDD implementation, jobs 2/3 to Dataset): <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/jobs.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/Job_RDD_Details.png> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27448/Job_Dataset_Details.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27448.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org