Re: run reduceByKey on huge data in spark

2015-06-30 Thread barge.nilesh
I 'm using 50 servers , 35 executors per server, 140GB memory per server

35 executors *per server* sounds kind of odd to me.

With 35 executors per server and server having 140gb, meaning each executor
is going to get only 4gb, 4gb will be divided in to shuffle/storage memory
fractions... assuming storage memory fraction=0.6 as default then 2.4gb
working space for each executor, so if any of the partition size (key group
size) exceeds 2.4gb there will be OOM...

May be you can try with the less number of executors per server/node...






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



run reduceByKey on huge data in spark

2015-06-30 Thread hotdog
I'm running reduceByKey in spark. My program is the simplest example of
spark:

val counts = textFile.flatMap(line = line.split( )).repartition(2).
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
counts.saveAsTextFile(hdfs://...)
but it always run out of memory...

I 'm using 50 servers , 35 executors per server, 140GB memory per server.

the documents volume is : 8TB documents, 20 billion documents, 1000 billion
words in total. and the words after reduce will be about 100 million.

I wonder how to set the configuration of spark?

I wonder what value should these parameters be?

1. the number of the maps ? 2 for example?
2. the number of the reduces ? 1 for example?
3. others parameters?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: run reduceByKey on huge data in spark

2015-06-30 Thread lisendong
hello, I ‘m using spark 1.4.2-SNAPSHOT
I ‘m running in yarn mode:-)

I wonder if the spark.shuffle.memoryFraction or spark.shuffle.manager work?
how to set these parameters...
 在 2015年7月1日,上午1:32,Ted Yu yuzhih...@gmail.com 写道:
 
 Which Spark release are you using ?
 
 Are you running in standalone mode ?
 
 Cheers
 
 On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com 
 mailto:lisend...@163.com wrote:
 I'm running reduceByKey in spark. My program is the simplest example of
 spark:
 
 val counts = textFile.flatMap(line = line.split( )).repartition(2).
  .map(word = (word, 1))
  .reduceByKey(_ + _, 1)
 counts.saveAsTextFile(hdfs://...)
 but it always run out of memory...
 
 I 'm using 50 servers , 35 executors per server, 140GB memory per server.
 
 the documents volume is : 8TB documents, 20 billion documents, 1000 billion
 words in total. and the words after reduce will be about 100 million.
 
 I wonder how to set the configuration of spark?
 
 I wonder what value should these parameters be?
 
 1. the number of the maps ? 2 for example?
 2. the number of the reduces ? 1 for example?
 3. others parameters?
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: run reduceByKey on huge data in spark

2015-06-30 Thread Ted Yu
Which Spark release are you using ?

Are you running in standalone mode ?

Cheers

On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com wrote:

 I'm running reduceByKey in spark. My program is the simplest example of
 spark:

 val counts = textFile.flatMap(line = line.split( )).repartition(2).
  .map(word = (word, 1))
  .reduceByKey(_ + _, 1)
 counts.saveAsTextFile(hdfs://...)
 but it always run out of memory...

 I 'm using 50 servers , 35 executors per server, 140GB memory per server.

 the documents volume is : 8TB documents, 20 billion documents, 1000 billion
 words in total. and the words after reduce will be about 100 million.

 I wonder how to set the configuration of spark?

 I wonder what value should these parameters be?

 1. the number of the maps ? 2 for example?
 2. the number of the reduces ? 1 for example?
 3. others parameters?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org