Hi Patrick/Aaron,

Sorry to revive this thread, but we're seeing some OOMEs errors again
(running with master a few commits after Aaron's optimizations). I
can tweak our job, but I just wanted to ask some clarifications.

> Another way to fix it is to modify your job to create fewer 
> partitions.

So, I get the impression that most people are not having OOMEs like we
are...why is that? Do we really have significantly more partitions than
most people use?

When Spark loads data directly from Hadoop/HDFS (which we don't do),
AFAIK the default partition size is 64mb, which surely results in a
large number of partitions (>10k?) for many data sets?

When this job loads from S3, 1 month of data was originally 67,000
files (1 file per partition), but then we have a routine that coalesces
it down to ~64mb partitions, which for this job meant 18,000 partitions.

18k partitions * 64mb/partition = ~1.1tb, which matches how much data
is in S3.

How many partitions would usually be a good idea for 1tb of data?

I was generally under the impression, what with the Sparrow/etc. slides
I had come across, that smaller partition sizes were better
scheduled/retried/etc. anyway.

So, yeah, I can manually partition this job down to a lower number, or
adjust our auto-coalescing to shoot for larger partitions, but I was
just hoping to get some feedback about what the Spark team considers an
acceptable number of partitions, both in general and for a ~1tb size
RDD.

I'll send a separate email with the specifics of the OOME/heap dump.

Thanks!

- Stephen

Reply via email to