Jacob Brunson wrote:
> So the depth number is the number of iterations the recrawl script
> will go through.  In each iteration, it will select a number of URLs
> from the crawl database (generate), download the pages at those URLs
> (fetch), and update the crawl database with the URLs that were fetched
> as well as any new URLs found (updatedb).
>
> If you want to redownload all your URLs in a single pass, you can set
> the depth to 1, the topN value to something around the number of pages
> you have in your database, and adddays to 31.
>
> The problem though is how do you keep it from adding in all the new
> URLs it finds during the crawl.  You can either create nice regex
> filters of the pages indexed to prevent this, or you could try
> removing the updatedb command from the script and see what that does.
> Removal of the updatedb command will certainly prevent your crawl
> database from seeing any new URLs your fetch found, but it might also
> have other bad consequences.

In the current trunk/ version updatedb supports an option -noAdditions. 
If specified, only initially injected URLs will be refreshed, and no new 
URLs will be added during updatedb operations.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to