Chris,
first of all it is possible to run several map tasks in one tasktracker.
See (Hadoop code) TaskTracker.java line 82 also there is a
configuration value:
<property>
<name>mapred.tasktracker.tasks.maximum</name>
<value>2</value>
<description>The maximum number of tasks that will be run
simultaneously by a task tracker.
</description>
</property>
But this does not effect your problem. Since each map task - also if
they running on the same task tracker - has it's own partition with
lexical divided urls.
Anyway this statement may does not answer the question but may one of
the core contributor can give a more detailed statement.
Stefan
Am 21.02.2006 um 05:07 schrieb Chris Schneider:
Nutch Developers,
At 9:00pm +0100 1/15/06, Andrzej Bialecki wrote:
Also, I think the current implementation is not optimal, because
it runs only a single map task for a fetcher. The reason for this
is that it was the easiest way to ensure that we don't violate the
politeness rules - if we ran multiple map tasks the methods
blockAddr/unblockAddr in protocol-http couldn't prevent other map
tasks from using the same address.
The proper solution is IMHO a central lock manager. I looked at
the code, it seems to me that JobTracker could manage this central
lock manager (one per job? one per cluster? perhaps both?), this
could be a part of a JobSubmissionProtocol - but I think there is
no way now for the arbitrary code to reference it's JobClient..
bummer.
As I understand it, the current MapReduce implementation of
fetching is restricted to running only one map task at a time on
each TaskTracker. This is because IP blocking can't span multiple
JVM instances, so there's no way to prevent two child processes on
the same TaskTracker from hitting the same server simply through a
Nutch-0.7-style blocking mechanism.
Assuming that the URLs have already been partitioned (and taking
aside the merits of partitioning this way vs. by IP address - see
my other email), wouldn't it be possible for the TaskTracker to
avoid having two child processes hitting the same domain by
ensuring that each was working on a separate domain?
Please forgive my vague understanding of the MapReduce
implementation. I may also have misunderstood the gist of Andrzej's
post (copied above).
Thanks,
- Chris
--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net