I'm trying to understand two things about how spark is working. (1) When I try to cache an rdd that fits well within memory (about 60g with about 600g of memory), I get seemingly random levels of caching, from around 60% to 100%, given the same tuning parameters. What governs how much of an RDD gets cached when there is enough memory?
(2) Even when cached, when I run some tasks over the data, I get various locality states. Sometimes it works perfectly, with everything PROCESS_LOCAL, and sometimes I get 10-20% of the data on locality ANY (and the task takes minutes instead of seconds); often this will vary if I run the task twice in a row in the same shell. Is there anything I can do to affect this? I tried caching with replication, but that caused everything to run out of memory nearly instantly (with the same 60g data set in 4-600g of memory) Thanks for the help, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com