So the depth number is the number of iterations the recrawl script
will go through.  In each iteration, it will select a number of URLs
from the crawl database (generate), download the pages at those URLs
(fetch), and update the crawl database with the URLs that were fetched
as well as any new URLs found (updatedb).

If you want to redownload all your URLs in a single pass, you can set
the depth to 1, the topN value to something around the number of pages
you have in your database, and adddays to 31.

The problem though is how do you keep it from adding in all the new
URLs it finds during the crawl.  You can either create nice regex
filters of the pages indexed to prevent this, or you could try
removing the updatedb command from the script and see what that does.
Removal of the updatedb command will certainly prevent your crawl
database from seeing any new URLs your fetch found, but it might also
have other bad consequences.

On 10/10/06, Chris Stephens <[EMAIL PROTECTED]> wrote:
> How does the depth option work on the 0.8 recrawl script that is on
> http://wiki.apache.org/nutch/IntranetRecrawl .  I just want to re-index
> all of the pages currently in the db and not index any new pages these
> pages might link to.  Should I use a 0 for this?  It seems like the
> fetcher never runs when I do 0, and if I do anything above zero it
> starts indexing at a further depth then what is currently in my crawl
> db, which is further then I desire.
>
> -Chris Stephens
>


-- 
http://JacobBrunson.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to