Have you tried persisting sourceFrame in (MEMORY_AND_DISK)?
May be you can cache updatedRDD which gets used in next two lines.

Are you sure you are paying the performance penalty because of shuffling
only?
Yes, group by is a killer. How much time does your code spend it GC?

Can't tell if your group by is actually unavoidable but there are times when
the data is temporal and operations need just one element before or after,
zipwithIndex and reduce may be used to avoid the group by call.


..Manas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-this-Spark-1-5-2-code-fast-and-shuffle-less-data-tp25671p25673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to