Not a expert, but groupByKey operation is well known to cause lot of
shuffling and usually operation performed by groupbykey operation can be
replaced by reducebykey.

Here is great article on groupByKey operation -

https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey


this whole repository - "spark-gotchas" is filled with lot of helpful tips.




Thank you,
*Pushkar Gujar*


On Wed, Apr 12, 2017 at 10:48 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi Steve,
>
> I have implemented repartitions on dataframe to 1. It helped the
> performance but not to a great extent. I am also looking for answers from
> the experts.
>
> Thanks,
> Asmath
>
> On Wed, Apr 12, 2017 at 9:45 AM, Steve Robinson <
> steve.robin...@aquilainsight.com> wrote:
>
>> Hi,
>>
>>
>> Does anyone have any optimisation tips or could propose an alternative
>> way to perform the below:
>>
>>
>> val groupedUserItems1 = userItems1.groupByKey{_.customer_id}
>> val groupedUserItems2 = userItems2.groupByKey{_.customer_id}
>> groupedUserItems1.cogroup(groupedUserItems2){
>>    case (_, userItems1, userItems2) =>
>>         processSingleUser(userItems1, userItems2)
>>     }
>> }
>>
>> The userItems1 and userItems2 datasets are quite large (100's millions of
>> records) so I'm finding the shuffle stage is shuffing Gigabytes of data.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks,
>>
>>
>> Steve Robinson
>>
>> steve.robin...@aquilainsight.com
>> 0131 290 2300
>>
>>
>> www.aquilainsight.com
>> linkedin.com/aquilainsight
>> <https://www.linkedin.com/company/aquila-insight>
>>
>> twitter.com/aquilainsight
>>
>
>

Reply via email to