Crawling only newly-injected URLs?

Siddhartha Reddy Wed, 06 May 2009 02:27:42 -0700

I have a Nutch crawldb into which new seed URLs are injected quite
regularly. Once new URLs are injected, I would like to do a crawl run that
fetches  these newly injected URLs only (i.e. URLs that were already present
in the crawldb before these new URLs were injected should not be fetched).
Is there any way to accomplish this?


One possibility I can think of is to use the FreeGenerator to generate a
fetchlist directly from the new seed URLs instead of ever injecting them
into the crawldb; the updatedb following the crawl would anyways put these
URLs in the crawldb. But this would cause URLs that have already been
fetched earlier (as recorded in the crawldb) to be refetched.

Another option could be to use a new crawldb for each set of new seed URLs,
do the fetch/updatedb, and then merge this new crawldb with the main
crawldb. Apart from having the additional overhead of merging the crawldb-s,
this apprach, like the one above, would also cause refetching of the URLs
that have already been fetched.

Is there any better approach?

Thanks,
Siddhartha

Crawling only newly-injected URLs?

Reply via email to