Hello,

I have many questions about joins, but arguably just one.

specifically about memory and containers that are overstepping their limits, as 
per errors dotted around all over the place, but something like: 
http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.1401112206043.18368.1401118322196@arcas%3E
 
<http://mail-archives.apache.org/mod_mbox/spark-issues/201405.mbox/%3CJIRA.12716648.1401112206043.18368.1401118322196@arcas%3E>

I have a job (mostly like this link: http://hastebin.com/quwamoreko.scala 
<http://hastebin.com/quwamoreko.scala>, but with a write-to-files-based-on-keys 
thing at the end) that is doing a join between a medium sized (like, 150,000 
entries, classRDD in the link) RDD to a larger (108 million entry, objRDD in 
the link) RDD… 

the keys and values for each entry are quite small. In the linked join most 
objects will have 10 or so classes and most classes 100k associated objects. 
Though a few (10 or so?) classes will have millions of objects and some objects 
hundreds of classes.

The issue i'm having is that (on an m2.xlarge ec2 instance) my container is 
overstepping the memory limits and being shut down

This confuses me and makes me question my fundamental understanding of joins.

I thought joins were a reduce operation that happened on disk. Further, my 
joins don’t seem to hold very much in memory, indeed at any given point a pair 
of strings and another string is all i seem to hold.

The container limit is 7Gb according to the error in my container logs and has 
been apparently reasonable for jobs i’ve run in the past.
But again, I don’t see where in my program i am actually keeping anything in 
memory at all.
And yet sure enough, after about 30 minutes of running, over a time period of 
like 2 or so minutes, one of the containers grows from about 800mb to 7Gb and 
is promptly killed. 

So, my questions, what could be going on here and how can i fix it? Is this 
just some fundamental feature of my data or is there anything else i can do? 

Further rider questions: Is there some logger settings I can use for the logs 
to tell me exactly where in my job has been reached? i.e. which RDD is being 
constructed or which join is being performed? The RDD numbers and stages aren’t 
all that helpful and though i know the spark UI exists some logs i can refer 
back to when my cluster has long died would be great.

Cheers
- Sina

Reply via email to