Chris,

first of all it is possible to run several map tasks in one tasktracker.
See (Hadoop code) TaskTracker.java line 82 also there is a configuration value:
<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

But this does not effect your problem. Since each map task - also if they running on the same task tracker - has it's own partition with lexical divided urls. Anyway this statement may does not answer the question but may one of the core contributor can give a more detailed statement.

Stefan



Am 21.02.2006 um 05:07 schrieb Chris Schneider:

Nutch Developers,

At 9:00pm +0100 1/15/06, Andrzej Bialecki wrote:
Also, I think the current implementation is not optimal, because it runs only a single map task for a fetcher. The reason for this is that it was the easiest way to ensure that we don't violate the politeness rules - if we ran multiple map tasks the methods blockAddr/unblockAddr in protocol-http couldn't prevent other map tasks from using the same address.

The proper solution is IMHO a central lock manager. I looked at the code, it seems to me that JobTracker could manage this central lock manager (one per job? one per cluster? perhaps both?), this could be a part of a JobSubmissionProtocol - but I think there is no way now for the arbitrary code to reference it's JobClient.. bummer.

As I understand it, the current MapReduce implementation of fetching is restricted to running only one map task at a time on each TaskTracker. This is because IP blocking can't span multiple JVM instances, so there's no way to prevent two child processes on the same TaskTracker from hitting the same server simply through a Nutch-0.7-style blocking mechanism.

Assuming that the URLs have already been partitioned (and taking aside the merits of partitioning this way vs. by IP address - see my other email), wouldn't it be possible for the TaskTracker to avoid having two child processes hitting the same domain by ensuring that each was working on a separate domain?

Please forgive my vague understanding of the MapReduce implementation. I may also have misunderstood the gist of Andrzej's post (copied above).

Thanks,

- Chris

--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to