A spark join and groupbykey that is making my containers on EC2 go over their memory limits

Sina Samangooei Wed, 11 Feb 2015 14:59:27 -0800

Hello,

I have many questions about joins, but arguably just one.

specifically about memory and containers that are overstepping their limits, as
per errors dotted around all over the place, but something like:
http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.1401112206043.18368.1401118322196@arcas%3E

<http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.1401112206043.18368.1401118322196@arcas%3E>

I have a job (mostly like this link: http://hastebin.com/quwamoreko.scala
<http://hastebin.com/quwamoreko.scala>, but with a write-to-files-based-on-keys
thing at the end) that is doing a join between a medium sized (like, 150,000
entries, classRDD in the link) RDD to a larger (108 million entry, objRDD in
the link) RDD…

the keys and values for each entry are quite small. In the linked join most
objects will have 10 or so classes and most classes 100k associated objects.
Though a few (10 or so?) classes will have millions of objects and some objects
hundreds of classes.

The issue i'm having is that (on an m2.xlarge ec2 instance) my container is
overstepping the memory limits and being shut down

This confuses me and makes me question my fundamental understanding of joins.

I thought joins were a reduce operation that happened on disk. Further, my
joins don’t seem to hold very much in memory, indeed at any given point a pair
of strings and another string is all i seem to hold.

The container limit is 7Gb according to the error in my container logs and has
been apparently reasonable for jobs i’ve run in the past.
But again, I don’t see where in my program i am actually keeping anything in
memory at all.
And yet sure enough, after about 30 minutes of running, over a time period of
like 2 or so minutes, one of the containers grows from about 800mb to 7Gb and
is promptly killed.

So, my questions, what could be going on here and how can i fix it? Is this
just some fundamental feature of my data or is there anything else i can do?

Further rider questions: Is there some logger settings I can use for the logs
to tell me exactly where in my job has been reached? i.e. which RDD is being
constructed or which join is being performed? The RDD numbers and stages aren’t
all that helpful and though i know the spark UI exists some logs i can refer
back to when my cluster has long died would be great.

Cheers
- Sina

A spark join and groupbykey that is making my containers on EC2 go over their memory limits

Reply via email to