I've a fairly large job (5E9 records, ~1600 partitions).wherein on a given stage, it looks like for the first half of the tasks, everything runs in process_local mode in ~10s/partition. Then, from halfway through, everything starts running in node_local mode, and takes 10x as long or more.
I read somewhere that the difference between the two had to do with the data being local to the running jvm, or another jvm on the same machine. If that's the case, shouldn't the distribution of the two modes be more random? If not, what exactly is the difference between the two modes? Given how much longer it takes in node_local mode, it seems like the whole thing would probably run much faster just by waiting for the right jvm to be free. Is there any way of forcing this? Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com