[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331950 ]
Fuad Efendi commented on NUTCH-109: ----------------------------------- Please see attachment for more details. In order to be fair (protocol-http uses single shared Socket per Host) I tried to modify this line in new plugin, HttpFactory.java: private static int CLIENTS_PER_HOST = NutchConf.get().getInt("http.clients.per.host", 1); It was 3 before. However, with http.clients.per.host=1 new plugin stops in a dead-lock. I tried few times, it always stops after 3-4 minutes. So, results are with http.clients.per.host=3 for new plugin (as it was before), but new plugin didn't pass the test, just a baseline. New Test Results: =============== 1. PROTOCOL-HTTP ================= 910,549,682 bytes (size on disk, WebDB+Segments) 1,201,908 milliseconds 2. PROTOCOL-HTTPCLIENT ======================== 935,856,675 bytes 1,261,064 milliseconds 999. PROTOCOL-HTTPCLIENT-INNOVATION ================================== 936,152,532 bytes 1,305,377 milliseconds nutch-site.xml ============== <property> <name>fetcher.server.delay</name> <value>0</value> </property> <property> <name>fetcher.threads.per.host</name> <value>20</value> </property> <property> <name>http.timeout</name> <value>30000</value> </property> <property> <name>http.content.limit</name> <value>-1</value> </property> Client: ======= IBM ThinkPad T42p, 2Ghz, 2Gb, Windows XP, J2SE 1.4.2_09 Server: ======= Suse Linux 9.3, Apache HTTPD 2.0.53-9.5, Worker Command: ======== bin\nutch7 crawl url3.txt -dir crawl005 -threads 20 -depth 6 (Modified crawl without indexing) > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > ----------------------------------------------------------------------- > > Key: NUTCH-109 > URL: http://issues.apache.org/jira/browse/NUTCH-109 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7, 0.8-dev, 0.6, 0.7.1 > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > Reporter: Fuad Efendi > Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt > > 1. TCP connection costs a lot, not only for Nutch and end-point web servers, > but also for intermediary network equipment > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > I am using separate network segment with Windows XP (Nutch), and Suse Linux > (Apache HTTPD + 120,000 pages) > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > Please note: > Class HttpFactory contains cache of HTTPConnection objects; each object run > each thread; each object is absolutely thread-safe, so we can send multiple > GET requests using single instance: > private static int CLIENTS_PER_HOST = > NutchConf.get().getInt("http.clients.per.host", 3); > I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira