Nutch crawl configuration

Arian Azin Sun, 11 Aug 2013 00:13:11 -0700

Hi Everyone,

I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
get 10 pages from each seed, not including pages from outlinks of the seed. Say
I want to crawl www.example1.com, and some pages there have outlinks to
www.example2.com. Here I provide example1.com as a seed, and want 10
pages (exactly
10, unless there doesn't exist that many) only from from example1.com (I
got 100+ sites to crawl, so I can't set regexes matching every single URL ).
Also, I want pictures and videos to be excluded  from crawl results.
Could anyone please help me with what I should set? I read the
documentation a couple of times with no results.


Thanks,
Arian

Nutch crawl configuration

Reply via email to