to the best of my understanding, you can't really do that.
you can use regex-urlfilter.txt and/or usffix-urlfilter to exclude the item that you don't want to crawl, so that should take care of the picture and video issue.
but you can't really limit the number of pages that would be fetched per site, you can do that per each fetch job thou. so you if it is only about 100 sites, you could run the fetch 100 times and only get 10 pages each time?
On 08/11/2013 12:12 AM, Arian Azin wrote:
Hi Everyone, I'm using Nutch 1.7 to crawl the contents of a number of sites. I want it to get 10 pages from each seed, not including pages from outlinks of the seed. Say I want to crawl www.example1.com, and some pages there have outlinks to www.example2.com. Here I provide example1.com as a seed, and want 10 pages (exactly 10, unless there doesn't exist that many) only from from example1.com (I got 100+ sites to crawl, so I can't set regexes matching every single URL ). Also, I want pictures and videos to be excluded from crawl results. Could anyone please help me with what I should set? I read the documentation a couple of times with no results. Thanks, Arian
-- Kaveh Minooie

