Hello - this is possible using the HostDB. If you updatehostdb frequently you 
get statistics on number of fetched, redirs, 404's and unfetched for any given 
host. Using readhostdb and a Jexl expression, you can then emit a blacklist of 
hosts that you can use during generate.

# Update the hostdb
bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb/

# Get list of hosts that have over 100 records fetched or not modified
bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(ok >= 100)'

# Or get list of hosts that have over 100 records in total
bin/nutch readhostdb crawl/hostdb/ output -dumpHostnames -expr '(numRecords >= 
100)'

List of fields that are expressible (line 93-104):
http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/hostdb/ReadHostDb.java?view=markup

You now have a list of hostnames that you can using with the 
domainblacklist-urlfilter at generate step.

Markus

 
-----Original message-----
> From:Tomasz <polish.software.develo...@gmail.com>
> Sent: Wednesday 24th February 2016 11:30
> To: user@nutch.apache.org
> Subject: Limit number of pages per host/domain
> 
> Hello,
> 
> One can set generate.max.count to limit number of urls for domain or host
> in next fetch step. But is there a way to limit number of fetched urls for
> domain/host in a whole crawl process? Supposing I run generate/fetch/update
> cycle 6 times and want to limit number of urls per host to 100 urls (pages)
> and not more in a whole crawldb. How can I achieve that?
> 
> Regards,
> Tomasz
> 

Reply via email to