Spark shuffling algorithm is very aggressive in storing everything in RAM,
and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least
in previous versions you can limit the shuffler memory, but Spark 1.6 will
use as much memory as it can get. What I see is that Spark seems to
underestimate the amount of memory that objects take up, and thus doesn't
spill frequently enough. There's a dirty work around (legacy mode) but the
common advice is to increase your parallelism (and keep in mind that
operations such as join have implicit parallelism, so you'll want to be
explicit about it).

-------
Regards,
Andy

On Mon, Feb 22, 2016 at 2:12 PM, Alex Dzhagriev <dzh...@gmail.com> wrote:

> Hello all,
>
> I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
> only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
> Unfortunately, I'm getting out of the overhead memory limit:
>
>
> Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB 
> physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>
>
> I'm giving 6GB overhead memory and using 10 cores per executor.
> Apparently, that's not enough. Without persisting the data and later
> computing the dataset (twice in my case) the job works fine. Can anyone,
> please, explain what is the overhead which consumes that much memory during
> persist to the disk and how can I estimate what extra memory should I give
> to the executors in order to make it not fail?
>
> Thanks, Alex.
>

Reply via email to