to the best of my understanding, you can't really do that.

you can use regex-urlfilter.txt and/or usffix-urlfilter to exclude the item that you don't want to crawl, so that should take care of the picture and video issue.

but you can't really limit the number of pages that would be fetched per site, you can do that per each fetch job thou. so you if it is only about 100 sites, you could run the fetch 100 times and only get 10 pages each time?



On 08/11/2013 12:12 AM, Arian Azin wrote:
Hi Everyone,

I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to
get 10 pages from each seed, not including pages from outlinks of the seed. Say
I want to crawl www.example1.com, and some pages there have outlinks to
www.example2.com. Here I provide example1.com as a seed, and want 10
pages (exactly
10, unless there doesn't exist that many) only from from example1.com (I
got 100+ sites to crawl, so I can't set regexes matching every single URL ).
Also, I want pictures and videos to be excluded  from crawl results.
Could anyone please help me with what I should set? I read the
documentation a couple of times with no results.

Thanks,
Arian


--
Kaveh Minooie

Reply via email to