[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331892 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

This method:
  private static InetAddress blockAddr(URL url) throws ProtocolException {...}

I checked it in both classes:
  org.apache.nutch.protocol.http.Http
  org.apache.nutch.protocol.httpclient.Http

Default settings (nutch-default.xml):
  fetcher.server.delay=5.0 (seconds)
  fetcher.threads.per.host=1

blockAddr(...) method blocks Internet Address for fetcher.server.delay amount 
of time, it "blocks" this address for all threads except current thread. Rest 
of threads are in Sleep() state; amount of sleeping threads is limited by
  fetcher.threads.per.host

So, playing with this parameters we can probably improve performance; I'm going 
to perform new performance tests.

New plugin does not use this:
  http.timeout=10000
  http.content.limit=65536

Keep-Alive timeout is very important; default "Keep-Alive" timeout of a new 
plugin is 60 seconds (it automatically closes HTTP after 60 seconds).

1. we are establishing TCP transport, 100-300 milliseconds X 2-3 times (TCP 
HandShake? some IP packets...)
2. Apache HTTPD Server creates Client thread to handle our requests, 1 second 
(more or less, try Internet Explorer, first page takes few second to download, 
then browsing works very fast - we have personal Thread on the Server).
3. Line 135, HttpResponse.java:
     get.releaseConnection();

Unfortunately we won't use HTTP/1.1 even if I modify some parameters such as
   HttpVersion.HTTP_1_0 (protocol-httpclient/HttpResponse.java:92)
- we close connection at the end...

We have network equipment limitations too, we can't reach more than 65000 
threads over single LAN card, and JVM is good (but better is to have multiple 
JVM/processes, 100 threads each...) 

We can load network segment for only 30% due to those HandShakes and delays...

Compare with any free available Web-Grabber tool, even IE/Netscape, downloading 
single big file can use 99% of network capacity, downloading multiple HTML - 
only 20-30% (I saw it in Teleport Pro during downloads from multiple linked to 
Apache sites, 10 threads)

Apache's MultiThreadedExample.java  uses single instance of HttpClient for 
multiple threads,
http://svn.apache.org/viewcvs.cgi/jakarta/commons/proper/httpclient/trunk/src/examples/MultiThreadedExample.java?view=markup

        // Create an HttpClient with the MultiThreadedHttpConnectionManager.
        // This connection manager must be used if more than one thread will
        // be using the HttpClient.
        HttpClient httpClient = new HttpClient(new 
MultiThreadedHttpConnectionManager());
        // Set the default host/protocol for the methods to connect to.
        // This value will only be used if the methods are not given an 
absolute URI
        httpClient.getHostConfiguration().setHost("jakarta.apache.org", 80, 
"http");


Same was done in a new plugin, with a basic very small code.

I am going to perform new tests; any suggestions are highly welcomed... 
it will take few days (10 hours per test)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>    private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to