On Mar 19, 2007, at 10:51 AM, Philippe Gassmann wrote:

Doug Cutting a écrit :
A simpler approach might be to develop an InputFormat that includes
multiple files per split.


Yes, but the issue remains present if you have to deal with a high
number of map tasks to distribute the load on many machines. Launching a
JVM is costly, let's say it costs 1 second (i'm optimistic) , if you
have to do 2000 map, there will be 2000 seconds lost in launching JVMs...

For task granularity, the most that makes sense is roughly 10-50 tasks/node. Given that a node runs at least 2 tasks at once, it maps into 5-25 seconds of wallclock time. It is noticeable, but shouldn't be the dominant factor.

I already have a working patch against the 0.10.1 release of Hadoop that launch tasks inside the TaskTracker JVM if a specific parameter is set
in the JobConf of the launched Job (for job we trust ;) ).

Another possible direction would be to have the Task JVM ask for another Task before exiting. I believe that Ben Reed experimented with that and the changes were not too extensive. For security, you would want to limit the JVM reuse to tasks within the same job.

As a side note, we've already seen cases of client code that killed the task trackers. So it is hardly an abstract concern. *smile* (The client code managed to send kill signals to the entire process group, which included the task tracker. It was hard to debug and I'm not very interested in making it easier for client code to take out the servers.)

-- Owen

Reply via email to