Re: Configuring Spark for reduceByKey on on massive data sets
hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Configuring Spark for reduceByKey on on massive data sets
Hi Matei, Thanks for the suggestions. Is the number of partitions set by calling 'myrrd.partitionBy(new HashPartitioner(N))'? Is there some heuristic formula for choosing a good number of partitions? thanks Daniel On Sat, May 17, 2014 at 8:33 PM, Matei Zaharia wrote: > Make sure you set up enough reduce partitions so you don’t overload them. > Another thing that may help is checking whether you’ve run out of local > disk space on the machines, and turning on spark.shuffle.consolidateFiles > to produce fewer files. Finally, there’s been a recent fix in both branch > 0.9 and master that reduces the amount of memory used when there are small > files (due to extra memory that was being taken by mmap()): > https://issues.apache.org/jira/browse/SPARK-1145. You can find this in > either the 1.0 release candidates on the dev list, or branch-0.9 in git. > > Matei > > On May 17, 2014, at 5:45 PM, Madhu wrote: > > > Daniel, > > > > How many partitions do you have? > > Are they more or less uniformly distributed? > > We have similar data volume currently running well on Hadoop MapReduce > with > > roughly 30 nodes. > > I was planning to test it with Spark. > > I'm very interested in your findings. > > > > > > > > - > > Madhu > > https://www.linkedin.com/in/msiddalingaiah > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >
Re: Configuring Spark for reduceByKey on on massive data sets
Hi Try using *reduceByKeyLocally*. Regards Lukas Nalezenec On Sun, May 18, 2014 at 3:33 AM, Matei Zaharia wrote: > Make sure you set up enough reduce partitions so you don’t overload them. > Another thing that may help is checking whether you’ve run out of local > disk space on the machines, and turning on spark.shuffle.consolidateFiles > to produce fewer files. Finally, there’s been a recent fix in both branch > 0.9 and master that reduces the amount of memory used when there are small > files (due to extra memory that was being taken by mmap()): > https://issues.apache.org/jira/browse/SPARK-1145. You can find this in > either the 1.0 release candidates on the dev list, or branch-0.9 in git. > > Matei > > On May 17, 2014, at 5:45 PM, Madhu wrote: > > > Daniel, > > > > How many partitions do you have? > > Are they more or less uniformly distributed? > > We have similar data volume currently running well on Hadoop MapReduce > with > > roughly 30 nodes. > > I was planning to test it with Spark. > > I'm very interested in your findings. > > > > > > > > - > > Madhu > > https://www.linkedin.com/in/msiddalingaiah > > -- > > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html > > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >
Re: Configuring Spark for reduceByKey on on massive data sets
Make sure you set up enough reduce partitions so you don’t overload them. Another thing that may help is checking whether you’ve run out of local disk space on the machines, and turning on spark.shuffle.consolidateFiles to produce fewer files. Finally, there’s been a recent fix in both branch 0.9 and master that reduces the amount of memory used when there are small files (due to extra memory that was being taken by mmap()): https://issues.apache.org/jira/browse/SPARK-1145. You can find this in either the 1.0 release candidates on the dev list, or branch-0.9 in git. Matei On May 17, 2014, at 5:45 PM, Madhu wrote: > Daniel, > > How many partitions do you have? > Are they more or less uniformly distributed? > We have similar data volume currently running well on Hadoop MapReduce with > roughly 30 nodes. > I was planning to test it with Spark. > I'm very interested in your findings. > > > > - > Madhu > https://www.linkedin.com/in/msiddalingaiah > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Configuring Spark for reduceByKey on on massive data sets
Daniel, How many partitions do you have? Are they more or less uniformly distributed? We have similar data volume currently running well on Hadoop MapReduce with roughly 30 nodes. I was planning to test it with Spark. I'm very interested in your findings. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p5967.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Configuring Spark for reduceByKey on on massive data sets
I have had a lot of success with Spark on large datasets, both in terms of performance and flexibility. However I hit a wall with reduceByKey when the RDD contains billions of items. I am reducing with simple functions like addition for building histograms, so the reduction process should be constant memory. I am using 10s of AWS-EC2 macines with 60G memory and 30 processors. After a while the whole process just hangs. I have not been able to isolate the root problem from the logs, but I suspect that the problem is in the shuffling. Simple mapping and filtering transfomations work fine, and the reductions work fine if I reduce the data down to 10^8 items makes the reduceByKey go through. What do I need to do to make reducByKey work for >10^9 items. thanks Daniel