Hi,

I find it a bit hard to follow your various ideas here, but I'll add my
comments to some parts below.

--- "Fuad Efendi (JIRA)" <[EMAIL PROTECTED]> wrote:

>     [
>
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892
> ] 
> 
> Fuad Efendi commented on NUTCH-109:
> -----------------------------------
> 
> This method:
>   private static InetAddress blockAddr(URL url) throws
> ProtocolException {...}

Where is this method?

> I checked it in both classes:
>   org.apache.nutch.protocol.http.Http
>   org.apache.nutch.protocol.httpclient.Http
> 
> Default settings (nutch-default.xml):
>   fetcher.server.delay=5.0 (seconds)
>   fetcher.threads.per.host=1
> 
> blockAddr(...) method blocks Internet Address for
> fetcher.server.delay amount of time, it "blocks" this address for all
> threads except current thread. Rest of threads are in Sleep() state;
> amount of sleeping threads is limited by
>   fetcher.threads.per.host

That doesn't sound right.  That property is not meant for specifying
sleep time, but rather the number of threads that are allowed to hit
the same host at the same time.  In other words, this lets you control
the degree of parallelization, so to speak.  That is the equivalent of
those "3 TCP connections" you were mentioning yesterday.

fetcher.server.delay is what specifies "sleep between requests" time.

> So, playing with this parameters we can probably improve performance;
> I'm going to perform new performance tests.
> 
> New plugin does not use this:
>   http.timeout=10000
>   http.content.limit=65536

This may affect your benchmark.  I don't know how much, but it will.

> Keep-Alive timeout is very important; default "Keep-Alive" timeout of
> a new plugin is 60 seconds (it automatically closes HTTP after 60
> seconds).
> 
> 1. we are establishing TCP transport, 100-300 milliseconds X 2-3
> times (TCP HandShake? some IP packets...)
> 2. Apache HTTPD Server creates Client thread to handle our requests,
> 1 second (more or less, try Internet Explorer, first page takes few
> second to download, then browsing works very fast - we have personal
> Thread on the Server).

This is often be due to the initial hostname address lookup, when the
domain name server doesn't have the host name IP address already
cached.

> 3. Line 135, HttpResponse.java:
>      get.releaseConnection();
> 
> Unfortunately we won't use HTTP/1.1 even if I modify some parameters
> such as
>    HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
> - we close connection at the end...

Have you seen Kelvin Tan's patch?
You should take a look, it's in JIRA, and addresses some of the
HTTP/1.1 issues that you are concerned about.

> We have network equipment limitations too, we can't reach more than
> 65000 threads over single LAN card, and JVM is good (but better is to
> have multiple JVM/processes, 100 threads each...) 

65000 threads?  What are you trying to fetch?  The whole web?


Otis


> We can load network segment for only 30% due to those HandShakes and
> delays...
> 
> Compare with any free available Web-Grabber tool, even IE/Netscape,
> downloading single big file can use 99% of network capacity,
> downloading multiple HTML - only 20-30% (I saw it in Teleport Pro
> during downloads from multiple linked to Apache sites, 10 threads)
> 
> Apache's MultiThreadedExample.java  uses single instance of
> HttpClient for multiple threads,
>
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup
> 
>         // Create an HttpClient with the
> MultiThreadedHttpConnectionManager.
>         // This connection manager must be used if more than one
> thread will
>         // be using the HttpClient.
>         HttpClient httpClient = new HttpClient(new
> MultiThreadedHttpConnectionManager());
>         // Set the default host/protocol for the methods to connect
> to.
>         // This value will only be used if the methods are not given
> an absolute URI
>        
> httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80,
> "http");
> 
> 
> Same was done in a new plugin, with a basic very small code.
> 
> I am going to perform new tests; any suggestions are highly
> welcomed... 
> it will take few days (10 hours per test)
> 
> 
> > Nutch - Fetcher - Performance Test - new
> Protocol-HTTPClient-Innovation
> >
>
-----------------------------------------------------------------------
> >
> >          Key: NUTCH-109
> >          URL: http://issues.apache.org/jira/browse/NUTCH-109
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
> >  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> > Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
> >     Reporter: Fuad Efendi
> >  Attachments: protocol-httpclient-innovation-0.1.0.zip
> >
> > 1. TCP connection costs a lot, not only for Nutch and end-point web
> servers, but also for intermediary network equipment 
> > 2. Web Server creates Client thread and hopes that Nutch really
> uses HTTP/1.1, or at least Nutch sends "Connection: close" before
> closing in JVM "Socket.close()" ...
> > I need to perform very objective tests, probably 2-3 days; new
> plugin crawled/parsed 23,000 pages for 1,321 seconds; it seems that
> existing http-plugin needs few days...
> > I am using separate network segment with Windows XP (Nutch), and
> Suse Linux (Apache HTTPD + 120,000 pages)
> > Please find attached new plugin based on
> http://www.innovation.ch/java/HTTPClient/
> > Please note: 
> > Class HttpFactory contains cache of HTTPConnection objects; each
> object run each thread; each object is absolutely thread-safe, so we
> can send multiple GET requests using single instance:
> >    private static int CLIENTS_PER_HOST =
> NutchConf.get().getInt("http.clients.per.host", 3);
> > I'll add more comments after finishing tests...
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the
> administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
> 
> 



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to