I'm also getting very slow crawl rates of around 6 pages/sec. I haven't been
able to analyze this issue at length yet, for instance by using ntop to see if
my network connection is pegged. However, I did get a slightly better result
(about 20%-30% better) by following Sami Siren's suggestion:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg06533.html
I've also tried fetch2 which I think is faster, though again I'm not getting a
radical improvement.
--Kai
----- Original Message ----
From: Audrey Liu <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, July 20, 2007 1:56:52 PM
Subject: tweaking config files for better performance
Hi,
I am using Nutch 0.9, and I'm trying to crawl our Intranet site (~60,000
pages, ~28,000 htmls). I've seen other posts where people mentioned they can
get their crawler to do 20pages/sec, and the best I've seen so far is only 8
pages/sec.
I've also read that the fetcher threads tend to block when it tries to fetch
pages from the same host. So I'm wondering what kind of configurations
should I set to get the best performance, my current configurations in
nutch-site.xml is as follows:
<property>
<name>fetcher.threads.fetch</name>
<value>200</value>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>50</value>
</property>
<property>
<name>http.max.delays</name>
<value>1</value>
</property>
Any pointers are greatly appreciated!! Thanks in advance.
AL
--
View this message in context:
http://www.nabble.com/tweaking-config-files-for-better-performance-tf4119552.html#a11715927
Sent from the Nutch - User mailing list archive at Nabble.com.
____________________________________________________________________________________
Got a little couch potato?
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general