Thank you for your answers! I'll put some thought into this and then we
might implement it if we can find time for it.


Best regards,
--Anders Rask
www.findwise.com

Den 11 april 2012 17:51 skrev Markus Jelsma <markus.jel...@openindex.io>:

> Ah, i see. Well, this is not possible right now and making this work may
> not be very easy as Nutch doesn't store the state of a domain or host.
>
> What you can do is periodically compute statistiscs on host or domain and
> add hosts or domains to the DomainBlackListFilter if they exceed your
> threshold. You must then use that filter together with the generator. It's
> some work but it will fix your issue.
>
> Keep in mind, the current domain statistics tool only aggregates
> statistics for fetched and not modified pages per host or domain but you
> might want to include redirects as well.
>
>
> On Wed, 11 Apr 2012 17:21:47 +0200, Anders Rask <anr...@gmail.com> wrote:
>
>> As I understand it, those properties will only limit the number of URLs
>> that are crawled per site for each time you run generate.
>>
>> But since Nutch works in such a way that you need to do an infinite loop
>> of
>> generate/fetch in order to recrawl sites then the total number of URLs
>> that
>> are crawled for one site will not be limited by the generate.max.count
>> parameter. Am I right?
>>
>>
>> Best regards,
>> --Anders Rask
>> www.findwise.com
>>
>> Den 11 april 2012 17:14 skrev Markus Jelsma <markus.jel...@openindex.io>:
>>
>>  Check these properties:
>>>
>>> 560     <property>
>>> 561     <name>generate.max.count</**name>
>>> 562     <value>-1</value>
>>> 563     <description>The maximum number of urls in a single
>>> 564     fetchlist. -1 if unlimited. The urls are counted according
>>> 565     to the value of the parameter generator.count.mode.
>>> 566     </description>
>>> 567     </property>
>>> 568
>>> 569     <property>
>>> 570     <name>generate.count.mode</**name>
>>> 571     <value>host</value>
>>> 572     <description>Determines how the URLs are counted for
>>> generator.max.count.
>>> 573     Default value is 'host' but can be 'domain'. Note that we do not
>>> count
>>> 574     per IP in the new version of the Generator.
>>> 575     </description>
>>> 576     </property>
>>>
>>>
>>>
>>> On Wednesday 11 April 2012 17:05:04 Anders Rask wrote:
>>> > Hi!
>>> >
>>> > I would like to be able to limit how many pages Nutch crawls from a
>>> > specific site, either by specifying the total number of pages to crawl
>>> from
>>> > one site or by specifying a depth of how many links that should be
>>> followed
>>> > from the initial seed.
>>> >
>>> > I've been working with Nutch for some time now but haven't been able to
>>> > figure out how this can be achieved. So my question is: Is there any
>>> way
>>> to
>>> > configure Nutch for this, and if not are there any plans to implement
>>> this
>>> > functionality?
>>> >
>>> >
>>> > Best regards,
>>> > --Anders Rask
>>> > www.findwise.com
>>>
>>> --
>>> Markus Jelsma - CTO - Openindex
>>>
>>>
>

Reply via email to