So the depth number is the number of iterations the recrawl script will go through. In each iteration, it will select a number of URLs from the crawl database (generate), download the pages at those URLs (fetch), and update the crawl database with the URLs that were fetched as well as any new URLs found (updatedb).
If you want to redownload all your URLs in a single pass, you can set the depth to 1, the topN value to something around the number of pages you have in your database, and adddays to 31. The problem though is how do you keep it from adding in all the new URLs it finds during the crawl. You can either create nice regex filters of the pages indexed to prevent this, or you could try removing the updatedb command from the script and see what that does. Removal of the updatedb command will certainly prevent your crawl database from seeing any new URLs your fetch found, but it might also have other bad consequences. On 10/10/06, Chris Stephens <[EMAIL PROTECTED]> wrote: > How does the depth option work on the 0.8 recrawl script that is on > http://wiki.apache.org/nutch/IntranetRecrawl . I just want to re-index > all of the pages currently in the db and not index any new pages these > pages might link to. Should I use a 0 for this? It seems like the > fetcher never runs when I do 0, and if I do anything above zero it > starts indexing at a further depth then what is currently in my crawl > db, which is further then I desire. > > -Chris Stephens > -- http://JacobBrunson.com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
