You are absolutely right. One way to limit per depth is to write a custom ScoringFilter to track the depth from the seed and prevent the outlinks from being added or the url from being generated.
Interestingly someone opened a JIRA on this NUTCH-1331 <https://issues.apache.org/jira/browse/NUTCH-1331%20>which contains a patch, however I think there could be a less intrusive approach HTH Julien On 11 April 2012 16:21, Anders Rask <anr...@gmail.com> wrote: > As I understand it, those properties will only limit the number of URLs > that are crawled per site for each time you run generate. > > But since Nutch works in such a way that you need to do an infinite loop of > generate/fetch in order to recrawl sites then the total number of URLs that > are crawled for one site will not be limited by the generate.max.count > parameter. Am I right? > > > Best regards, > --Anders Rask > www.findwise.com > > Den 11 april 2012 17:14 skrev Markus Jelsma <markus.jel...@openindex.io>: > > > Check these properties: > > > > 560 <property> > > 561 <name>generate.max.count</name> > > 562 <value>-1</value> > > 563 <description>The maximum number of urls in a single > > 564 fetchlist. -1 if unlimited. The urls are counted according > > 565 to the value of the parameter generator.count.mode. > > 566 </description> > > 567 </property> > > 568 > > 569 <property> > > 570 <name>generate.count.mode</name> > > 571 <value>host</value> > > 572 <description>Determines how the URLs are counted for > > generator.max.count. > > 573 Default value is 'host' but can be 'domain'. Note that we do not > > count > > 574 per IP in the new version of the Generator. > > 575 </description> > > 576 </property> > > > > > > > > On Wednesday 11 April 2012 17:05:04 Anders Rask wrote: > > > Hi! > > > > > > I would like to be able to limit how many pages Nutch crawls from a > > > specific site, either by specifying the total number of pages to crawl > > from > > > one site or by specifying a depth of how many links that should be > > followed > > > from the initial seed. > > > > > > I've been working with Nutch for some time now but haven't been able to > > > figure out how this can be achieved. So my question is: Is there any > way > > to > > > configure Nutch for this, and if not are there any plans to implement > > this > > > functionality? > > > > > > > > > Best regards, > > > --Anders Rask > > > www.findwise.com > > > > -- > > Markus Jelsma - CTO - Openindex > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble