Make sure you set up enough reduce partitions so you don’t overload them. 
Another thing that may help is checking whether you’ve run out of local disk 
space on the machines, and turning on spark.shuffle.consolidateFiles to produce 
fewer files. Finally, there’s been a recent fix in both branch 0.9 and master 
that reduces the amount of memory used when there are small files (due to extra 
memory that was being taken by mmap()): 
https://issues.apache.org/jira/browse/SPARK-1145. You can find this in either 
the 1.0 release candidates on the dev list, or branch-0.9 in git.

Matei

On May 17, 2014, at 5:45 PM, Madhu <ma...@madhu.com> wrote:

> Daniel,
> 
> How many partitions do you have?
> Are they more or less uniformly distributed?
> We have similar data volume currently running well on Hadoop MapReduce with
> roughly 30 nodes. 
> I was planning to test it with Spark. 
> I'm very interested in your findings. 
> 
> 
> 
> -----
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to