GroupBy issue while running K-Means - Dataframe

Deepak Sharma Tue, 16 Jun 2020 00:30:15 -0700

Hi All,
I have a custom implementation of K-Means where it needs the data to be
grouped by a key in a dataframe.
Now there is a big data skew for some of the keys , where it exceeds the
BufferHolder:
 Cannot grow BufferHolder by size 17112 because the size after growing
exceeds size limitation 2147483632


I tried solving it by converting the dataframe to RDD and then using
reduceByKey on RDD and converting it back to RDD.
This gives Java Heap : Out of memory error.
Since it looks like a common issue , i was wondering how anyone would be
solving this problem ?
-- 
Thanks
Deepak

GroupBy issue while running K-Means - Dataframe

Reply via email to