I have a Nutch crawldb into which new seed URLs are injected quite regularly. Once new URLs are injected, I would like to do a crawl run that fetches these newly injected URLs only (i.e. URLs that were already present in the crawldb before these new URLs were injected should not be fetched). Is there any way to accomplish this?
One possibility I can think of is to use the FreeGenerator to generate a fetchlist directly from the new seed URLs instead of ever injecting them into the crawldb; the updatedb following the crawl would anyways put these URLs in the crawldb. But this would cause URLs that have already been fetched earlier (as recorded in the crawldb) to be refetched. Another option could be to use a new crawldb for each set of new seed URLs, do the fetch/updatedb, and then merge this new crawldb with the main crawldb. Apart from having the additional overhead of merging the crawldb-s, this apprach, like the one above, would also cause refetching of the URLs that have already been fetched. Is there any better approach? Thanks, Siddhartha
