Some suggestion to improve performance:

1. Decrease randomization of FetchList.
 
Here is comment from FetchListTool:
   /**
     * The TableSet class will allocate a given FetchListEntry
     * into one of several ArrayFiles.  It chooses which
     * ArrayFile based on a hash of the URL's domain name.
     *
     * It uses a hash of the domain name so that pages are
     * allocated to a random ArrayFile, but same-host pages
     * go to the same file (for efficiency purposes during
     * fetch).
     *
     * Further, within a given file, the FetchListEntry items
     * appear in random order.  This is so that we don't
     * hammer the same site over and over again during fetch.
     *
     * Each table should receive a roughly
     * even number of entries, but all URLs for a  specific 
     * domain name will be found in a single table.  If
     * the dataset is weirdly skewed toward large domains,
     * there may be an uneven distribution.
     */

Same "same-host pages go to the same file" - they should go in a
sequence, without mixing/randomizing with other host-pages...

We are fetching single URL, then we forget about existense of this
TCP/IP connection, we even forget that Web Server created Client Process
to handle our HTTP requests, it is called Keep Alive. Creation of TCP
connection, and additionally creation of a such Client Process on a Web
Server costs a lot of CPU on both sides, Nutch & WebServer.

I suggest to use single Keep-Alive thread to fetch single Host, without
randomization.


2. Use/Investigate more staff from Socket API such as
public void setSoTimeout(int timeout)
public void setReuseAddress(true)

I found this in J2SE API for setReuseAddress(default: false):
=====
When a TCP connection is closed the connection may remain in a timeout
state for a period of time after the connection is closed (typically
known as the TIME_WAIT state or 2MSL wait state). For applications using
a well known socket address or port it may not be possible to bind a
socket to the required SocketAddress if there is a connection in the
timeout state involving the socket address or port. 
=====

It probably means that we are reaching huge amount (65000!) of "waiting"
TCP ports after Socket.close(); and Fetcher Theads are blocking by OS
waiting when OS release some of these ports... Am I right?


P.S.
Anyway, using Keep-Alive option is very important not only for us but
also for Production Web Sites.

Thanks,
Fuad





-----Original Message-----
From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 30, 2005 10:58 PM
To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: what contibute to fetch slowing down


Dear Nutchers,


I noticed same problem twise, with PentiumMobile2Mhz & WindowsXP & 2Gb,
and with 2xOpteron252 x SuseLinux x 4Gb

I have only one explanation which should be probably mirrored at JIRA:


================
Network.
========


1.
I never had such a problem with The Grinder,
http://grinder.sourceforge.net, which is based on alternate HTTPClient
http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should
really review their HttpClient RC3(!!!) accordingly, HTTPClient
(upper--HTTP-case)is not "alpha", it is production version... I used
Grinder a lot, it allows to execute 32 processes with 64 threads each on
2048Mb RAM...


2.
I found at SUN API this: 
java.net.Socket
public void setReuseAddress(boolean on) - please check API!!!


3. 
I saw in your PROTOCOL-HTTP this code:
... HTTP/1.0 ...
Why? Why version 1.0??? It should understand server's reply such as
"Connection: close" "Connection: keep-alive" etc. (pls ignore typo).


4.
By the way, how many files UNIX needs in order to maintain 65536 network
sockets?


Respectfully,
Fuad

P.S.
Sorry guys, I don't have anough time to participate... Could you please
test this suspicious behaviour, and very strange opinion? Should I
create a new bug report at JIRA? 

SUN's Socket, Apache's HttpClient, UNIX's networking...




-----Original Message-----
From: Daniele Menozzi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 28, 2005 4:42 PM
To: nutch-dev@lucene.apache.org
Subject: Re: what contibute to fetch slowing down


On  10:27:55 28/Sep , AJ Chen wrote:
> I started the crawler with about 2000 sites.  The fetcher could
> achieve
> 7 pages/sec initially, but the performance gradually dropped to about
2 
> pages/sec, sometimes even 0.5 pages/sec.  The fetch list had 300k
pages 
> and I used 500 threads. What are the main causes of this slowing down?


I have the same problem; I've tried with different number of fetchers
(10,20,50,100,..), but the download rate always decrease sistematically,
page after page. The machine is a p4 1.7, 768 MB ram, running debian on
2.6.12 kernel. The bandwidth isn't a problem (10Mbit in and 10Mbit out),
but I cannot obtain a stable, and high, page/s rate. I've also tried to
change machine and kernel, but the problem still remains. Can you please
give us some advice? Thank you for your help,
        Menoz



-- 
                      Free Software Enthusiast
                 Debian Powered Linux User #332564 
                     http://menoz.homelinux.org




Reply via email to