Re: run reduceByKey on huge data in spark

lisendong Tue, 30 Jun 2015 10:37:45 -0700
hello, I ‘m using spark 1.4.2-SNAPSHOT
I ‘m running in yarn mode:-)

I wonder if the spark.shuffle.memoryFraction or spark.shuffle.manager work?
how to set these parameters...
> 在 2015年7月1日，上午1:32，Ted Yu <yuzhih...@gmail.com> 写道：
> 
> Which Spark release are you using ?
> 
> Are you running in standalone mode ?
> 
> Cheers
> 
> On Tue, Jun 30, 2015 at 10:03 AM, hotdog <lisend...@163.com 
> <mailto:lisend...@163.com>> wrote:
> I'm running reduceByKey in spark. My program is the simplest example of
> spark:
> 
> val counts = textFile.flatMap(line => line.split(" ")).repartition(20000).
>                  .map(word => (word, 1))
>                  .reduceByKey(_ + _, 10000)
> counts.saveAsTextFile("hdfs://...")
> but it always run out of memory...
> 
> I 'm using 50 servers , 35 executors per server, 140GB memory per server.
> 
> the documents volume is : 8TB documents, 20 billion documents, 1000 billion
> words in total. and the words after reduce will be about 100 million.
> 
> I wonder how to set the configuration of spark?
> 
> I wonder what value should these parameters be?
> 
> 1. the number of the maps ? 20000 for example?
> 2. the number of the reduces ? 10000 for example?
> 3. others parameters?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
>
Re: run reduceByKey on huge data in spark

Reply via email to