As a followup on this, the memory footprint of all shuffle metadata has
been  greatly reduced. For your original workload with 7k mappers, 7k
reducers, and 5 machines, the total metadata size should have decreased
from ~3.3 GB to ~80 MB.


On Tue, Oct 29, 2013 at 9:07 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> Great! Glad to hear it worked out. Spark definitely has a pain point about
> deciding the right number of partitions, and I think we're going to be
> spending a lot of time trying to reduce that issue.
>
> Currently working on the patch to reduce the shuffle file block overheads,
> but in the meantime, you can set "spark.shuffle.consolidateFiles=false" to
> exchange OOMEs due to too many partitions for worse performance (probably
> an acceptable tradeoff).
>
>
>
> On Mon, Oct 28, 2013 at 2:31 PM, Stephen Haberman <
> stephen.haber...@gmail.com> wrote:
>
>> Hey guys,
>>
>> As a follow up, I raised our target partition size to 600mb (up from
>> 64mb), which split this report's 500gb of tiny S3 files into ~700
>> partitions, and everything ran much smoother.
>>
>> In retrospect, this was the same issue we'd ran into before, having too
>> many partitions, and had previously solved by throwing some guesses at
>> coalesce to make it magically go away.
>>
>> But now I feel like we have a much better understanding of why the
>> numbers need to be what they are, which is great.
>>
>> So, thanks for all the input and helping me understand what's going on.
>>
>> It'd be great to see some of the optimizations to BlockManager happen,
>> but I understand in the end why it needs to track what it does. And I
>> was also admittedly using a small cluster anyway.
>>
>> - Stephen
>>
>>
>

Reply via email to