As a followup on this, the memory footprint of all shuffle metadata has been greatly reduced. For your original workload with 7k mappers, 7k reducers, and 5 machines, the total metadata size should have decreased from ~3.3 GB to ~80 MB.
On Tue, Oct 29, 2013 at 9:07 AM, Aaron Davidson <ilike...@gmail.com> wrote: > Great! Glad to hear it worked out. Spark definitely has a pain point about > deciding the right number of partitions, and I think we're going to be > spending a lot of time trying to reduce that issue. > > Currently working on the patch to reduce the shuffle file block overheads, > but in the meantime, you can set "spark.shuffle.consolidateFiles=false" to > exchange OOMEs due to too many partitions for worse performance (probably > an acceptable tradeoff). > > > > On Mon, Oct 28, 2013 at 2:31 PM, Stephen Haberman < > stephen.haber...@gmail.com> wrote: > >> Hey guys, >> >> As a follow up, I raised our target partition size to 600mb (up from >> 64mb), which split this report's 500gb of tiny S3 files into ~700 >> partitions, and everything ran much smoother. >> >> In retrospect, this was the same issue we'd ran into before, having too >> many partitions, and had previously solved by throwing some guesses at >> coalesce to make it magically go away. >> >> But now I feel like we have a much better understanding of why the >> numbers need to be what they are, which is great. >> >> So, thanks for all the input and helping me understand what's going on. >> >> It'd be great to see some of the optimizations to BlockManager happen, >> but I understand in the end why it needs to track what it does. And I >> was also admittedly using a small cluster anyway. >> >> - Stephen >> >> >