[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ] 

Fuad Efendi commented on NUTCH-109:
-----------------------------------

Please see attachment for more details.

In order to be fair (protocol-http uses single shared Socket per Host) I tried 
to modify this line in new plugin, HttpFactory.java:
private static int CLIENTS_PER_HOST = 
NutchConf.get().getInt("http.clients.per.host", 1);

It was 3 before. However, with http.clients.per.host=1 new plugin stops in a 
dead-lock. I tried few times, it always stops after 3-4 minutes. So, results 
are with http.clients.per.host=3 for new plugin (as it was before), but new 
plugin didn't pass the test, just a baseline.



New Test Results:
===============

1. PROTOCOL-HTTP 
=================
910,549,682 bytes (size on disk, WebDB+Segments)
1,201,908 milliseconds

2. PROTOCOL-HTTPCLIENT
========================
935,856,675 bytes
1,261,064 milliseconds

999. PROTOCOL-HTTPCLIENT-INNOVATION
==================================
936,152,532 bytes
1,305,377 milliseconds




nutch-site.xml
==============
<property>
  <name>fetcher.server.delay</name>
  <value>0</value>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>20</value>
</property>

<property>
  <name>http.timeout</name>
  <value>30000</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>



Client:
=======
IBM ThinkPad T42p, 2Ghz, 2Gb, Windows XP, J2SE 1.4.2_09


Server:
=======
Suse Linux 9.3, Apache HTTPD 2.0.53-9.5, Worker


Command:
========
bin\nutch7 crawl url3.txt -dir crawl005 -threads 20 -depth 6

(Modified crawl without indexing)


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> -----------------------------------------------------------------------
>
>          Key: NUTCH-109
>          URL: http://issues.apache.org/jira/browse/NUTCH-109
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
>     Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>    private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to