Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.
You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this
Hi,
I am trying to group by data in spark and find out maximum value for group
of data. I have to use group by as I need to transpose based on the values.
I tried repartition data by increasing number from 1 to 1.Job gets run
till the below stage and it takes long time to move ahead. I was
Date: Friday, July 3, 2015 at 8:58 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Spark performance issue
Hello guys,
I'm after some advice on Spark performance.
I've a MapReduce job that read inputs carry out a simple calculation and write
the results into HDFS. I've
Hello guys,
I'm after some advice on Spark performance.
I've a MapReduce job that read inputs carry out a simple calculation and
write the results into HDFS. I've implemented the same logic in Spark job.
When I tried both jobs on same datasets, I'm getting different execution
time, which is
Hello all,
I am a newbie to Spark, Just analyzing the product. I am facing a
performance problem with hive, Trying analyse whether the Spark will solve
it or not. but it seems that Spark also taking lot of time.Let me know if I
miss anything.
shark select count(time) from table2;
OK
6050
Time