Maintainance of Nutch: crawl everything again?

carmmello Mon, 23 May 2005 14:59:57 -0700

Hi,

I think that the nutch-user mailing is becoming a very useful tool for the people interested in running Nutch.

For that reason, I am asking a question whose answer I could not find, nor in the tutorial (including Stefan's one) nor in the mailing list. This question has to do with the maintenance of the crawled (and indexed) sites, when expiring time comes. This, really, is not clear to me. For testing purposes, I indexed about 3,000 sites with the expiring time of just 1 day (I set this in the site.xml, configuration file). After that 1 day I used the command "bin/nutch generate db segments" with the only option "-refetchonly". When I did a fetch of the generated segment, I got about 30,000 sites. So, I really don't know what to do just to maintain updated, in a regular basis, the sites that really matter for some specific searching purpose.

So, my question is: to maintain the data, is it necessary to repeat the crawl process all over again, or is there some other quicker way that just grabs the modified or added sites, since the original crawl?

Thanks

No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.322 / Virus Database: 266.11.15 - Release Date: 22/5/2005

Maintainance of Nutch: crawl everything again?

Reply via email to