Hi,

Just to add on top of what Gabriele et Luis said, you may want to look at the "db.ignore.external.links". If you have a large seed list, crawl-urlfilter et regex-urlfilter can become quite a pain to maintain and may have performance impact (am I right?). If you don't want to add new links, even from the same host, then you should take a look at "db.update.additions.allowed".

-----Message d'origine----- From: Luis Cappa Banda
Sent: Saturday, May 14, 2011 8:18 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hello.

As Gabriele said before, you should specify your url´s list that you´ll use
for crawling/fetching. You can enter an specific url or a domain url
pointing to, for example, a particular html pdf file. Of course, you can put
several urls as a list, not only one. In any case, be careful with the
crawl-urlfilter.txt config file: if you have configured it before maybe the
pattern that you decided wont´t be aplicable for your new url list and then
you won´t index anything. It´s a very common mistake.

Reply via email to