You are absolutely right. One way to limit per depth is to write a custom
ScoringFilter to track the depth from the seed and prevent the outlinks
from being added or the url from being generated.

Interestingly someone opened a JIRA on this NUTCH-1331
<https://issues.apache.org/jira/browse/NUTCH-1331%20>which contains a
patch, however I think there could be a less intrusive approach

HTH

Julien

On 11 April 2012 16:21, Anders Rask <anr...@gmail.com> wrote:

> As I understand it, those properties will only limit the number of URLs
> that are crawled per site for each time you run generate.
>
> But since Nutch works in such a way that you need to do an infinite loop of
> generate/fetch in order to recrawl sites then the total number of URLs that
> are crawled for one site will not be limited by the generate.max.count
> parameter. Am I right?
>
>
> Best regards,
> --Anders Rask
> www.findwise.com
>
> Den 11 april 2012 17:14 skrev Markus Jelsma <markus.jel...@openindex.io>:
>
> > Check these properties:
> >
> > 560     <property>
> > 561     <name>generate.max.count</name>
> > 562     <value>-1</value>
> > 563     <description>The maximum number of urls in a single
> > 564     fetchlist. -1 if unlimited. The urls are counted according
> > 565     to the value of the parameter generator.count.mode.
> > 566     </description>
> > 567     </property>
> > 568
> > 569     <property>
> > 570     <name>generate.count.mode</name>
> > 571     <value>host</value>
> > 572     <description>Determines how the URLs are counted for
> > generator.max.count.
> > 573     Default value is 'host' but can be 'domain'. Note that we do not
> > count
> > 574     per IP in the new version of the Generator.
> > 575     </description>
> > 576     </property>
> >
> >
> >
> > On Wednesday 11 April 2012 17:05:04 Anders Rask wrote:
> > > Hi!
> > >
> > > I would like to be able to limit how many pages Nutch crawls from a
> > > specific site, either by specifying the total number of pages to crawl
> > from
> > > one site or by specifying a depth of how many links that should be
> > followed
> > > from the initial seed.
> > >
> > > I've been working with Nutch for some time now but haven't been able to
> > > figure out how this can be achieved. So my question is: Is there any
> way
> > to
> > > configure Nutch for this, and if not are there any plans to implement
> > this
> > > functionality?
> > >
> > >
> > > Best regards,
> > > --Anders Rask
> > > www.findwise.com
> >
> > --
> > Markus Jelsma - CTO - Openindex
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to