Re: How to tell Nutch to crawl ONLY the URLs I've injected

kevin chen Thu, 25 Jun 2009 19:27:56 -0700

If all you want is crawl your own URLS, you can do following:
(1) inject all URLS
(2) keep generating segments to fetch without update crawldb.
(3) After you are done with fetching, update, index and your are done.


On Thu, 2009-06-25 at 07:27 -0700, caezar wrote:
> Hi All,
> 
> Here is the problem: I need Nutch to crawl ONLY the URLs I've injected.
> Currently, by setting db.ignore.external.links to true I've made Nutch not
> to automatically crawl URLs found as external links from on crawled pages.
> But it is still crawling URLs found as internal links (seems that
> db.ignore.internal.links does not affects this). I don't want to create URL
> filters, because there are millions of URLs, and not possible to write a
> regexps for them. So is there a way to achieve this?

Re: How to tell Nutch to crawl ONLY the URLs I've injected

Reply via email to