On Wed, Aug 26, 2009 at 9:43 AM, Paul Tomblin<[email protected]> wrote:
> On Wed, Aug 26, 2009 at 10:34 AM, Ken
> Krugler<[email protected]> wrote:
>> If the sites you are crawling are under your control, or you have an
>> understanding with the site ops people, then you can alter Nutch's default
>> settings to make it run at near full speed.
>
> What settings would those be? I tried increasing the number of
> threads from 10 to 125, but it had absolutely no discernible effect on
> the crawl speed.
>
Paul,
I'd read the nutch-default.xml file, I believe the properties you'd
like to examine start in the section labelled <!-- fetcher properties
-->
fetcher.threads.per.host
fetcher.server.delay
fetcher.server.min.delay
fetcher.max.crawl.delay
I'm guessing there are others but those 4 looked like they were most
closely related. Spending a bit of time reading the descriptions in
conf/nutch-default.xml is very helpful for tracking these things down.
Override those values in conf/nutch-site.xml, don't directly change
the nutch-default.xml (at least that's what everything I've read
recommends).
Thanks,
Kirby
> --
> http://www.linkedin.com/in/paultomblin
>