Roger Dunk wrote:
Andrzej stated in NUTCH-669 that "some people reported performance
issues with Fetcher2, i.e. that it doesn't use the available bandwidth.
These reports are unconfirmed, and they may have been caused by
suboptimal URL / host distribution in a fetchlist - but it would be good
to review the synchronization and threading aspects of Fetcher2."
To address this, I've tried just now generating a fetchlist using
generate.max.per.host = 1 (which gave me 35,000 unique hosts) to
guarantee unique hosts, but the problem still remains.
Therefore, I believe it's clearly not an issue of suboptimal URL / host
distribution. If you require any further information to confirm my
report, you need only ask!
I have so far seen two sources for slowness, don't know it they are
related to your case:
1. You are using nutch from behind nat box. I experienced this problem
when I did some test crawling from a machine sitting behind adsl router
that did NAT. Soon after starting a crawl the maximum number of NAT
connections was reached in the router and furter connections could only
be made after old ones timeouted from NAT table. These connections were
mostly DNS connections.
2. Your machine has ip6 enabled. This I noticed more recently when I was
wondering relatively slow fetching speed on a box. After disabling ipv6
totally I was able to fetch 2-4 times faster without any other config
changes.
--
Sami Siren