As I understand it, those properties will only limit the number of URLs
that are crawled per site for each time you run generate.

But since Nutch works in such a way that you need to do an infinite loop of
generate/fetch in order to recrawl sites then the total number of URLs that
are crawled for one site will not be limited by the generate.max.count
parameter. Am I right?


Best regards,
--Anders Rask
www.findwise.com

Den 11 april 2012 17:14 skrev Markus Jelsma <markus.jel...@openindex.io>:

> Check these properties:
>
> 560     <property>
> 561     <name>generate.max.count</name>
> 562     <value>-1</value>
> 563     <description>The maximum number of urls in a single
> 564     fetchlist. -1 if unlimited. The urls are counted according
> 565     to the value of the parameter generator.count.mode.
> 566     </description>
> 567     </property>
> 568
> 569     <property>
> 570     <name>generate.count.mode</name>
> 571     <value>host</value>
> 572     <description>Determines how the URLs are counted for
> generator.max.count.
> 573     Default value is 'host' but can be 'domain'. Note that we do not
> count
> 574     per IP in the new version of the Generator.
> 575     </description>
> 576     </property>
>
>
>
> On Wednesday 11 April 2012 17:05:04 Anders Rask wrote:
> > Hi!
> >
> > I would like to be able to limit how many pages Nutch crawls from a
> > specific site, either by specifying the total number of pages to crawl
> from
> > one site or by specifying a depth of how many links that should be
> followed
> > from the initial seed.
> >
> > I've been working with Nutch for some time now but haven't been able to
> > figure out how this can be achieved. So my question is: Is there any way
> to
> > configure Nutch for this, and if not are there any plans to implement
> this
> > functionality?
> >
> >
> > Best regards,
> > --Anders Rask
> > www.findwise.com
>
> --
> Markus Jelsma - CTO - Openindex
>

Reply via email to