A question about the fetching MapReduce process: Is it possible that
some segments will happen to be slower than others and thus will
prevent the whole job from finishing? It seems that the problem will
probably get worse with more fetch nodes, which is what we're aiming
at.

What about running one fetcher on each node 24/7? Each fetcher would
take segments from a global queue. Other parts of the system do not
have to wait untill the to-fetch queue is depleted before doing the DB
update and new segment generation. So basically adding a queue will
allow pipelining of the time consuming work, namely fetching, db
update and segment generation. And we will not end up waiting for one
or two fetchers to finish their job.

- Feng Zhou
Grad Student, CS, UC Berkeley

On Mon, 28 Mar 2005 11:36:47 -0800, Doug Cutting <[EMAIL PROTECTED]> wrote:
> A few weeks ago I drafted the attached document, discussing how
> MapReduce might be used in Nutch.  This is an incomplete, exploratory
> document, not a final design.  Most of Nutch's file formats are altered.
>   Every operation is implemented with MapReduce.  To run things on a
> single machine we can automatically start a job tracker one or more task
> trackers, all running in the same JVM.  Hopefully this will not be much
> slower than the current implementation running on a single machine.
> 
> Comments?
> 
> Doug


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to