Re: Nutch crawl configuration

kaveh minooie Mon, 12 Aug 2013 13:37:22 -0700

to the best of my understanding, you can't really do that.

you can use regex-urlfilter.txt and/or usffix-urlfilter to exclude theitem that you don't want to crawl, so that should take care of thepicture and video issue.

but you can't really limit the number of pages that would be fetched persite, you can do that per each fetch job thou. so you if it is onlyabout 100 sites, you could run the fetch 100 times and only get 10 pageseach time?




On 08/11/2013 12:12 AM, Arian Azin wrote:

Hi Everyone,

I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
get 10 pages from each seed, not including pages from outlinks of the seed. Say
I want to crawl www.example1.com, and some pages there have outlinks to
www.example2.com. Here I provide example1.com as a seed, and want 10
pages (exactly
10, unless there doesn't exist that many) only from from example1.com (I
got 100+ sites to crawl, so I can't set regexes matching every single URL ).
Also, I want pictures and videos to be excluded  from crawl results.
Could anyone please help me with what I should set? I read the
documentation a couple of times with no results.

Thanks,
Arian


--
Kaveh Minooie

Re: Nutch crawl configuration

Reply via email to