Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this

GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed
Hi, I am trying to group by data in spark and find out maximum value for group of data. I have to use group by as I need to transpose based on the values. I tried repartition data by increasing number from 1 to 1.Job gets run till the below stage and it takes long time to move ahead. I was

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
Date: Friday, July 3, 2015 at 8:58 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Spark performance issue Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've

Spark performance issue

2015-07-03 Thread diplomatic Guru
Hello guys, I'm after some advice on Spark performance. I've a MapReduce job that read inputs carry out a simple calculation and write the results into HDFS. I've implemented the same logic in Spark job. When I tried both jobs on same datasets, I'm getting different execution time, which is

Spark Performance issue

2014-07-15 Thread Malligarjunan S
Hello all, I am a newbie to Spark, Just analyzing the product. I am facing a performance problem with hive, Trying analyse whether the Spark will solve it or not. but it seems that Spark also taking lot of time.Let me know if I miss anything. shark select count(time) from table2; OK 6050 Time