Some suggestion to improve performance:

1. Decrease randomization of FetchList.
Here is comment from FetchListTool:
     * The TableSet class will allocate a given FetchListEntry
     * into one of several ArrayFiles.  It chooses which
     * ArrayFile based on a hash of the URL's domain name.
     * It uses a hash of the domain name so that pages are
     * allocated to a random ArrayFile, but same-host pages
     * go to the same file (for efficiency purposes during
     * fetch).
     * Further, within a given file, the FetchListEntry items
     * appear in random order.  This is so that we don't
     * hammer the same site over and over again during fetch.
     * Each table should receive a roughly
     * even number of entries, but all URLs for a  specific 
     * domain name will be found in a single table.  If
     * the dataset is weirdly skewed toward large domains,
     * there may be an uneven distribution.

Same "same-host pages go to the same file" - they should go in a
sequence, without mixing/randomizing with other host-pages...

We are fetching single URL, then we forget about existense of this
TCP/IP connection, we even forget that Web Server created Client Process
to handle our HTTP requests, it is called Keep Alive. Creation of TCP
connection, and additionally creation of a such Client Process on a Web
Server costs a lot of CPU on both sides, Nutch & WebServer.

I suggest to use single Keep-Alive thread to fetch single Host, without

2. Use/Investigate more staff from Socket API such as
public void setSoTimeout(int timeout)
public void setReuseAddress(true)

I found this in J2SE API for setReuseAddress(default: false):
When a TCP connection is closed the connection may remain in a timeout
state for a period of time after the connection is closed (typically
known as the TIME_WAIT state or 2MSL wait state). For applications using
a well known socket address or port it may not be possible to bind a
socket to the required SocketAddress if there is a connection in the
timeout state involving the socket address or port. 

It probably means that we are reaching huge amount (65000!) of "waiting"
TCP ports after Socket.close(); and Fetcher Theads are blocking by OS
waiting when OS release some of these ports... Am I right?

Anyway, using Keep-Alive option is very important not only for us but
also for Production Web Sites.


-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 30, 2005 10:58 PM
Subject: RE: what contibute to fetch slowing down

Dear Nutchers,

I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb,
and with 2xOpteron252 x SuseLinux x 4Gb

I have only one explanation which should be probably mirrored at JIRA:


I never had such a problem with The Grinder,, which is based on alternate HTTPClient Apache SF should
really review their HttpClient RC3(!!!) accordingly, HTTPClient
(upper--HTTP-case)is not "alpha", it is production version... I used
Grinder a lot, it allows to execute 32 processes with 64 threads each on
2048Mb RAM...

I found at SUN API this:
public void setReuseAddress(boolean on) - please check API!!!

I saw in your PROTOCOL-HTTP this code:
... HTTP/1.0 ...
Why? Why version 1.0??? It should understand server's reply such as
"Connection: close" "Connection: keep-alive" etc. (pls ignore typo).

By the way, how many files UNIX needs in order to maintain 65536 network


Sorry guys, I don't have anough time to participate... Could you please
test this suspicious behaviour, and very strange opinion? Should I
create a new bug report at JIRA? 

SUN's Socket, Apache's HttpClient, UNIX's networking...

-----Original Message-----
From: Daniele Menozzi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 28, 2005 4:42 PM
Subject: Re: what contibute to fetch slowing down

On  10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites.  The fetcher could
> achieve
> 7 pages/sec initially, but the performance gradually dropped to about
> pages/sec, sometimes even 0.5 pages/sec.  The fetch list had 300k
> and I used 500 threads. What are the main causes of this slowing down?

I have the same problem; I've tried with different number of fetchers
(10,20,50,100,..), but the download rate always decrease sistematically,
page after page. The machine is a p4 1.7, 768 MB ram, running debian on
2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out),
but I cannot obtain a stable, and high, page/s rate. I've also tried to
change machine and kernel, but the problem still remains. Can you please
give us some advice? Thank you for your help,

                      Free Software Enthusiast
                 Debian Powered Linux User #332564 

Reply via email to