Re: how to force nutch to crawl specific urls?

Jean-Francois Gingras Sat, 14 May 2011 14:55:43 -0700

Hi,

Just to add on top of what Gabriele et Luis said, you may want to look atthe "db.ignore.external.links". If you have a large seed list,crawl-urlfilter et regex-urlfilter can become quite a pain to maintain andmay have performance impact (am I right?). If you don't want to add newlinks, even from the same host, then you should take a look at"db.update.additions.allowed".

-----Message d'origine-----From: Luis Cappa Banda

Sent: Saturday, May 14, 2011 8:18 AM
To: [email protected]
Subject: Re: how to force nutch to crawl specific urls?

Hello.

As Gabriele said before, you should specify your url´s list that you´ll use
for crawling/fetching. You can enter an specific url or a domain url
pointing to, for example, a particular html pdf file. Of course, you can put
several urls as a list, not only one. In any case, be careful with the
crawl-urlfilter.txt config file: if you have configured it before maybe the
pattern that you decided wont´t be aplicable for your new url list and then

you won´t index anything. It´s a very common mistake.

Re: how to force nutch to crawl specific urls?

Reply via email to