Re: Limit Nutch Crawl to Seed URLs

Stevan Kovacevic Fri, 13 Mar 2009 06:19:42 -0700

Hi,
you can avoid going to other domains by editing the urlfilter file,
but this is not too practical when you have a lot of seed urls, which
you do.  In nutch-default.xml file you have a property
db.ignore.external.links which is by default set to false. Set this to
true and you will only crawl seed url domains. This file is located in
the conf folder, in case you don't know. Note that if. while crawling,
you bump into a link that redirects you to another domain, nutch will
consider the domain you are redirected to as valid.


On Fri, Mar 13, 2009 at 10:59 AM, MyD <myd.ro...@googlemail.com> wrote:
>
> Hi @ all,
>
> is it possible to limit nutchs crawling process to the seed URLs? E.g. I
> have 1000 seed URLs and I want to crawl just this domains. Thanks in
> advance.
>
> Regards,
> MyD
> --
> View this message in context: 
> http://www.nabble.com/Limit-Nutch-Crawl-to-Seed-URLs-tp22493314p22493314.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>

Re: Limit Nutch Crawl to Seed URLs

Reply via email to