Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this

GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed
Hi, I am trying to group by data in spark and find out maximum value for group of data. I have to use group by as I need to transpose based on the values. I tried repartition data by increasing number from 1 to 1.Job gets run till the below stage and it takes long time to move ahead. I was