I've a fairly large job (5E9 records, ~1600 partitions).wherein on a given
stage, it looks like for the first half of the tasks, everything runs in
process_local mode in ~10s/partition.  Then, from halfway through,
everything starts running in node_local mode, and takes 10x as long or more.

I read somewhere that the difference between the two had to do with the
data being local to the running jvm, or another jvm on the same machine.
 If that's the case, shouldn't the distribution of the two modes be more
random?  If not, what exactly is the difference between the two modes?
 Given how much longer it takes in node_local mode, it seems like the whole
thing would probably run much faster just by waiting for the right jvm to
be free.  Is there any way of forcing this?


Thanks,
              -Nathan


-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Reply via email to