Jacob Brunson wrote: > So the depth number is the number of iterations the recrawl script > will go through. In each iteration, it will select a number of URLs > from the crawl database (generate), download the pages at those URLs > (fetch), and update the crawl database with the URLs that were fetched > as well as any new URLs found (updatedb). > > If you want to redownload all your URLs in a single pass, you can set > the depth to 1, the topN value to something around the number of pages > you have in your database, and adddays to 31. > > The problem though is how do you keep it from adding in all the new > URLs it finds during the crawl. You can either create nice regex > filters of the pages indexed to prevent this, or you could try > removing the updatedb command from the script and see what that does. > Removal of the updatedb command will certainly prevent your crawl > database from seeing any new URLs your fetch found, but it might also > have other bad consequences.
In the current trunk/ version updatedb supports an option -noAdditions. If specified, only initially injected URLs will be refreshed, and no new URLs will be added during updatedb operations. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
