Thank you for your answers! I'll put some thought into this and then we might implement it if we can find time for it.
Best regards, --Anders Rask www.findwise.com Den 11 april 2012 17:51 skrev Markus Jelsma <markus.jel...@openindex.io>: > Ah, i see. Well, this is not possible right now and making this work may > not be very easy as Nutch doesn't store the state of a domain or host. > > What you can do is periodically compute statistiscs on host or domain and > add hosts or domains to the DomainBlackListFilter if they exceed your > threshold. You must then use that filter together with the generator. It's > some work but it will fix your issue. > > Keep in mind, the current domain statistics tool only aggregates > statistics for fetched and not modified pages per host or domain but you > might want to include redirects as well. > > > On Wed, 11 Apr 2012 17:21:47 +0200, Anders Rask <anr...@gmail.com> wrote: > >> As I understand it, those properties will only limit the number of URLs >> that are crawled per site for each time you run generate. >> >> But since Nutch works in such a way that you need to do an infinite loop >> of >> generate/fetch in order to recrawl sites then the total number of URLs >> that >> are crawled for one site will not be limited by the generate.max.count >> parameter. Am I right? >> >> >> Best regards, >> --Anders Rask >> www.findwise.com >> >> Den 11 april 2012 17:14 skrev Markus Jelsma <markus.jel...@openindex.io>: >> >> Check these properties: >>> >>> 560 <property> >>> 561 <name>generate.max.count</**name> >>> 562 <value>-1</value> >>> 563 <description>The maximum number of urls in a single >>> 564 fetchlist. -1 if unlimited. The urls are counted according >>> 565 to the value of the parameter generator.count.mode. >>> 566 </description> >>> 567 </property> >>> 568 >>> 569 <property> >>> 570 <name>generate.count.mode</**name> >>> 571 <value>host</value> >>> 572 <description>Determines how the URLs are counted for >>> generator.max.count. >>> 573 Default value is 'host' but can be 'domain'. Note that we do not >>> count >>> 574 per IP in the new version of the Generator. >>> 575 </description> >>> 576 </property> >>> >>> >>> >>> On Wednesday 11 April 2012 17:05:04 Anders Rask wrote: >>> > Hi! >>> > >>> > I would like to be able to limit how many pages Nutch crawls from a >>> > specific site, either by specifying the total number of pages to crawl >>> from >>> > one site or by specifying a depth of how many links that should be >>> followed >>> > from the initial seed. >>> > >>> > I've been working with Nutch for some time now but haven't been able to >>> > figure out how this can be achieved. So my question is: Is there any >>> way >>> to >>> > configure Nutch for this, and if not are there any plans to implement >>> this >>> > functionality? >>> > >>> > >>> > Best regards, >>> > --Anders Rask >>> > www.findwise.com >>> >>> -- >>> Markus Jelsma - CTO - Openindex >>> >>> >