Not a expert, but groupByKey operation is well known to cause lot of shuffling and usually operation performed by groupbykey operation can be replaced by reducebykey.
Here is great article on groupByKey operation - https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey this whole repository - "spark-gotchas" is filled with lot of helpful tips. Thank you, *Pushkar Gujar* On Wed, Apr 12, 2017 at 10:48 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi Steve, > > I have implemented repartitions on dataframe to 1. It helped the > performance but not to a great extent. I am also looking for answers from > the experts. > > Thanks, > Asmath > > On Wed, Apr 12, 2017 at 9:45 AM, Steve Robinson < > steve.robin...@aquilainsight.com> wrote: > >> Hi, >> >> >> Does anyone have any optimisation tips or could propose an alternative >> way to perform the below: >> >> >> val groupedUserItems1 = userItems1.groupByKey{_.customer_id} >> val groupedUserItems2 = userItems2.groupByKey{_.customer_id} >> groupedUserItems1.cogroup(groupedUserItems2){ >> case (_, userItems1, userItems2) => >> processSingleUser(userItems1, userItems2) >> } >> } >> >> The userItems1 and userItems2 datasets are quite large (100's millions of >> records) so I'm finding the shuffle stage is shuffing Gigabytes of data. >> >> Any help would be greatly appreciated. >> >> Thanks, >> >> >> Steve Robinson >> >> steve.robin...@aquilainsight.com >> 0131 290 2300 >> >> >> www.aquilainsight.com >> linkedin.com/aquilainsight >> <https://www.linkedin.com/company/aquila-insight> >> >> twitter.com/aquilainsight >> > >