Hi Matei, Thanks for the suggestions. Is the number of partitions set by calling 'myrrd.partitionBy(new HashPartitioner(N))'? Is there some heuristic formula for choosing a good number of partitions?
thanks Daniel On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > Make sure you set up enough reduce partitions so you don’t overload them. > Another thing that may help is checking whether you’ve run out of local > disk space on the machines, and turning on spark.shuffle.consolidateFiles > to produce fewer files. Finally, there’s been a recent fix in both branch > 0.9 and master that reduces the amount of memory used when there are small > files (due to extra memory that was being taken by mmap()): > https://issues.apache.org/jira/browse/SPARK-1145. You can find this in > either the 1.0 release candidates on the dev list, or branch-0.9 in git. > > Matei > > On May 17, 2014, at 5:45 PM, Madhu <ma...@madhu.com> wrote: > > > Daniel, > > > > How many partitions do you have? > > Are they more or less uniformly distributed? > > We have similar data volume currently running well on Hadoop MapReduce > with > > roughly 30 nodes. > > I was planning to test it with Spark. > > I'm very interested in your findings. > > > > > > > > ----- > > Madhu > > https://www.linkedin.com/in/msiddalingaiah > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >