[Nutch-dev] Re: Single Map Task Requirement for Fetching

Stefan Groschupf Tue, 21 Feb 2006 06:35:02 -0800

Chris,

first of all it is possible to run several map tasks in one tasktracker.

See (Hadoop code) TaskTracker.java line 82 also there is aconfiguration value:

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>

But this does not effect your problem. Since each map task - also ifthey running on the same task tracker - has it's own partition withlexical divided urls.Anyway this statement may does not answer the question but may one ofthe core contributor can give a more detailed statement.


Stefan



Am 21.02.2006 um 05:07 schrieb Chris Schneider:

Nutch Developers,

At 9:00pm +0100 1/15/06, Andrzej Bialecki wrote:
Also, I think the current implementation is not optimal, becauseit runs only a single map task for a fetcher. The reason for thisis that it was the easiest way to ensure that we don't violate thepoliteness rules - if we ran multiple map tasks the methodsblockAddr/unblockAddr in protocol-http couldn't prevent other maptasks from using the same address.
The proper solution is IMHO a central lock manager. I looked atthe code, it seems to me that JobTracker could manage this centrallock manager (one per job? one per cluster? perhaps both?), thiscould be a part of a JobSubmissionProtocol - but I think there isno way now for the arbitrary code to reference it's JobClient..bummer.
As I understand it, the current MapReduce implementation offetching is restricted to running only one map task at a time oneach TaskTracker. This is because IP blocking can't span multipleJVM instances, so there's no way to prevent two child processes onthe same TaskTracker from hitting the same server simply through aNutch-0.7-style blocking mechanism.
Assuming that the URLs have already been partitioned (and takingaside the merits of partitioning this way vs. by IP address - seemy other email), wouldn't it be possible for the TaskTracker toavoid having two child processes hitting the same domain byensuring that each was working on a separate domain?
Please forgive my vague understanding of the MapReduceimplementation. I may also have misunderstood the gist of Andrzej'spost (copied above).
Thanks,

- Chris

--
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

[Nutch-dev] Re: Single Map Task Requirement for Fetching

Reply via email to