Re: Configuring Spark for reduceByKey on on massive data sets

Daniel Mahler Sun, 18 May 2014 05:31:23 -0700

Hi Matei,

Thanks for the suggestions.
Is the number of partitions set by calling 'myrrd.partitionBy(new
HashPartitioner(N))'?
Is there some heuristic formula for choosing a good number of partitions?


thanks
Daniel




On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:

> Make sure you set up enough reduce partitions so you don’t overload them.
> Another thing that may help is checking whether you’ve run out of local
> disk space on the machines, and turning on spark.shuffle.consolidateFiles
> to produce fewer files. Finally, there’s been a recent fix in both branch
> 0.9 and master that reduces the amount of memory used when there are small
> files (due to extra memory that was being taken by mmap()):
> https://issues.apache.org/jira/browse/SPARK-1145. You can find this in
> either the 1.0 release candidates on the dev list, or branch-0.9 in git.
>
> Matei
>
> On May 17, 2014, at 5:45 PM, Madhu <ma...@madhu.com> wrote:
>
> > Daniel,
> >
> > How many partitions do you have?
> > Are they more or less uniformly distributed?
> > We have similar data volume currently running well on Hadoop MapReduce
> with
> > roughly 30 nodes.
> > I was planning to test it with Spark.
> > I'm very interested in your findings.
> >
> >
> >
> > -----
> > Madhu
> > https://www.linkedin.com/in/msiddalingaiah
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: Configuring Spark for reduceByKey on on massive data sets

Reply via email to