Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.

You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this problem because these functions can be evaluated using map-side
aggregation:
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html


-------
Regards,
Andy

On Tue, Jan 17, 2017 at 5:39 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am trying to group by data in spark and find out maximum value for group
> of data. I have to use group by as I need to transpose based on the values.
>
> I tried repartition data by increasing number from 1 to 10000.Job gets run
> till the below stage and it takes long time to move ahead. I was never
> successful, job gets killed after somtime with GC overhead limit issues.
>
>
> [image: Inline image 1]
>
> Increased Memory limits too. Not sure what is going wrong, can anyone
> guide me through right approach.
>
> Thanks,
> Asmath
>

Reply via email to