If all you want is crawl your own URLS, you can do following: (1) inject all URLS (2) keep generating segments to fetch without update crawldb. (3) After you are done with fetching, update, index and your are done.
On Thu, 2009-06-25 at 07:27 -0700, caezar wrote: > Hi All, > > Here is the problem: I need Nutch to crawl ONLY the URLs I've injected. > Currently, by setting db.ignore.external.links to true I've made Nutch not > to automatically crawl URLs found as external links from on crawled pages. > But it is still crawling URLs found as internal links (seems that > db.ignore.internal.links does not affects this). I don't want to create URL > filters, because there are millions of URLs, and not possible to write a > regexps for them. So is there a way to achieve this?
