[Nutch-dev] Generating multiple fetchlists between updates

Andrzej Bialecki Fri, 13 Jan 2006 05:32:09 -0800

Hi,

In the 0.7 branch, whenever a segment was generated the WebDB wasmodified, so that the entries that ended up in the fetchlist wouldn't beimmediately available to the next segment generation, if that happenedbefore the WebDB was updated with the data from that first segment. Thiswas achieved by adding 1 week to the next fetchTime on a Page.

I can't see that we do it in the trunk. This means that we cannotgenerate more than one fetchlist between the CrawlDB updates, becauseeach fetchlist would be identical to the previous one... Should we worryabout this? There is a cost to modify the CrawlDB, but there is also acost to not be able to generate multiple different fetchlists and fetchthem in parallel...


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Generating multiple fetchlists between updates

Reply via email to