I'm trying to understand two things about how spark is working.

(1) When I try to cache an rdd that fits well within memory (about 60g with
about 600g of memory), I get seemingly random levels of caching, from
around 60% to 100%, given the same tuning parameters.  What governs how
much of an RDD gets cached when there is enough memory?

(2) Even when cached, when I run some tasks over the data, I get various
locality states.  Sometimes it works perfectly, with everything
PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and
the task takes minutes instead of seconds); often this will vary if I run
the task twice in a row in the same shell.  Is there anything I can do to
affect this?  I tried caching with replication, but that caused everything
to run out of memory nearly instantly (with the same 60g data set in 4-600g
of memory)

Thanks for the help,

                -Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Reply via email to