Re: run reduceByKey on huge data in spark
I 'm using 50 servers , 35 executors per server, 140GB memory per server 35 executors *per server* sounds kind of odd to me. With 35 executors per server and server having 140gb, meaning each executor is going to get only 4gb, 4gb will be divided in to shuffle/storage memory fractions... assuming storage memory fraction=0.6 as default then 2.4gb working space for each executor, so if any of the partition size (key group size) exceeds 2.4gb there will be OOM... May be you can try with the less number of executors per server/node... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
run reduceByKey on huge data in spark
I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word = (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile(hdfs://...) but it always run out of memory... I 'm using 50 servers , 35 executors per server, 140GB memory per server. the documents volume is : 8TB documents, 20 billion documents, 1000 billion words in total. and the words after reduce will be about 100 million. I wonder how to set the configuration of spark? I wonder what value should these parameters be? 1. the number of the maps ? 2 for example? 2. the number of the reduces ? 1 for example? 3. others parameters? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: run reduceByKey on huge data in spark
hello, I ‘m using spark 1.4.2-SNAPSHOT I ‘m running in yarn mode:-) I wonder if the spark.shuffle.memoryFraction or spark.shuffle.manager work? how to set these parameters... 在 2015年7月1日,上午1:32,Ted Yu yuzhih...@gmail.com 写道: Which Spark release are you using ? Are you running in standalone mode ? Cheers On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com mailto:lisend...@163.com wrote: I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word = (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile(hdfs://...) but it always run out of memory... I 'm using 50 servers , 35 executors per server, 140GB memory per server. the documents volume is : 8TB documents, 20 billion documents, 1000 billion words in total. and the words after reduce will be about 100 million. I wonder how to set the configuration of spark? I wonder what value should these parameters be? 1. the number of the maps ? 2 for example? 2. the number of the reduces ? 1 for example? 3. others parameters? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: run reduceByKey on huge data in spark
Which Spark release are you using ? Are you running in standalone mode ? Cheers On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com wrote: I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word = (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile(hdfs://...) but it always run out of memory... I 'm using 50 servers , 35 executors per server, 140GB memory per server. the documents volume is : 8TB documents, 20 billion documents, 1000 billion words in total. and the words after reduce will be about 100 million. I wonder how to set the configuration of spark? I wonder what value should these parameters be? 1. the number of the maps ? 2 for example? 2. the number of the reduces ? 1 for example? 3. others parameters? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org