|
Hi,
I think that the nutch-user mailing is becoming a
very useful tool for the people interested in running Nutch.
For that reason, I am asking a question whose
answer I could not find, nor in the tutorial (including Stefan's one) nor in the
mailing list. This question has to do with the maintenance of the
crawled (and indexed) sites, when expiring time comes. This, really, is not
clear to me. For testing purposes, I indexed about 3,000 sites with the
expiring time of just 1 day (I set this in the site.xml, configuration
file). After that 1 day I used the command "bin/nutch generate
db segments" with the only option "-refetchonly". When I did a fetch
of the generated segment, I got about 30,000 sites. So, I really don't
know what to do just to maintain updated, in a regular basis, the sites that
really matter for some specific searching purpose.
So, my question is: to maintain the data, is
it necessary to repeat the crawl process all over again, or is there some
other quicker way that just grabs the modified or added sites, since the
original crawl?
Thanks
|
No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.322 / Virus Database: 266.11.15 - Release Date: 22/5/2005
