Have you tried persisting sourceFrame in (MEMORY_AND_DISK)? May be you can cache updatedRDD which gets used in next two lines.
Are you sure you are paying the performance penalty because of shuffling only? Yes, group by is a killer. How much time does your code spend it GC? Can't tell if your group by is actually unavoidable but there are times when the data is temporal and operations need just one element before or after, zipwithIndex and reduce may be used to avoid the group by call. ..Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671p25673.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org