Some suggestion to improve performance:
1. Decrease randomization of FetchList. Here is comment from FetchListTool: /** * The TableSet class will allocate a given FetchListEntry * into one of several ArrayFiles. It chooses which * ArrayFile based on a hash of the URL's domain name. * * It uses a hash of the domain name so that pages are * allocated to a random ArrayFile, but same-host pages * go to the same file (for efficiency purposes during * fetch). * * Further, within a given file, the FetchListEntry items * appear in random order. This is so that we don't * hammer the same site over and over again during fetch. * * Each table should receive a roughly * even number of entries, but all URLs for a specific * domain name will be found in a single table. If * the dataset is weirdly skewed toward large domains, * there may be an uneven distribution. */ Same "same-host pages go to the same file" - they should go in a sequence, without mixing/randomizing with other host-pages... We are fetching single URL, then we forget about existense of this TCP/IP connection, we even forget that Web Server created Client Process to handle our HTTP requests, it is called Keep Alive. Creation of TCP connection, and additionally creation of a such Client Process on a Web Server costs a lot of CPU on both sides, Nutch & WebServer. I suggest to use single Keep-Alive thread to fetch single Host, without randomization. 2. Use/Investigate more staff from Socket API such as public void setSoTimeout(int timeout) public void setReuseAddress(true) I found this in J2SE API for setReuseAddress(default: false): ===== When a TCP connection is closed the connection may remain in a timeout state for a period of time after the connection is closed (typically known as the TIME_WAIT state or 2MSL wait state). For applications using a well known socket address or port it may not be possible to bind a socket to the required SocketAddress if there is a connection in the timeout state involving the socket address or port. ===== It probably means that we are reaching huge amount (65000!) of "waiting" TCP ports after Socket.close(); and Fetcher Theads are blocking by OS waiting when OS release some of these ports... Am I right? P.S. Anyway, using Keep-Alive option is very important not only for us but also for Production Web Sites. Thanks, Fuad -----Original Message----- From: Fuad Efendi [mailto:[EMAIL PROTECTED] Sent: Friday, September 30, 2005 10:58 PM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: what contibute to fetch slowing down Dear Nutchers, I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb, and with 2xOpteron252 x SuseLinux x 4Gb I have only one explanation which should be probably mirrored at JIRA: ================ Network. ======== 1. I never had such a problem with The Grinder, http://grinder.sourceforge.net, which is based on alternate HTTPClient http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should really review their HttpClient RC3(!!!) accordingly, HTTPClient (upper--HTTP-case)is not "alpha", it is production version... I used Grinder a lot, it allows to execute 32 processes with 64 threads each on 2048Mb RAM... 2. I found at SUN API this: java.net.Socket public void setReuseAddress(boolean on) - please check API!!! 3. I saw in your PROTOCOL-HTTP this code: ... HTTP/1.0 ... Why? Why version 1.0??? It should understand server's reply such as "Connection: close" "Connection: keep-alive" etc. (pls ignore typo). 4. By the way, how many files UNIX needs in order to maintain 65536 network sockets? Respectfully, Fuad P.S. Sorry guys, I don't have anough time to participate... Could you please test this suspicious behaviour, and very strange opinion? Should I create a new bug report at JIRA? SUN's Socket, Apache's HttpClient, UNIX's networking... -----Original Message----- From: Daniele Menozzi [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 28, 2005 4:42 PM To: nutch-dev@lucene.apache.org Subject: Re: what contibute to fetch slowing down On 10:27:55 28/Sep , AJ Chen wrote: > I started the crawler with about 2000 sites. The fetcher could > achieve > 7 pages/sec initially, but the performance gradually dropped to about 2 > pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages > and I used 500 threads. What are the main causes of this slowing down? I have the same problem; I've tried with different number of fetchers (10,20,50,100,..), but the download rate always decrease sistematically, page after page. The machine is a p4 1.7, 768 MB ram, running debian on 2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out), but I cannot obtain a stable, and high, page/s rate. I've also tried to change machine and kernel, but the problem still remains. Can you please give us some advice? Thank you for your help, Menoz -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org