[ http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12332083 ]
Fuad Efendi commented on NUTCH-109: ----------------------------------- Sorry for typo in previous post: Apache HTTPD server, 1 Gb RAM, single CPU, Worker model... it uses multiple processes and multiple threads, about 1.2Mb memory per thread. Default setting for KeepAliveTimeout on Server: 15 seconds http://httpd.apache.org/docs/2.0/mod/core.html#keepalivetimeout We are using "Keep-Alive" only when we send subsequent requests within this 15 seconds interval. Current Nutch is polite, with default 5 seconds interval and randomization in a fetch list. I was wrong, my previous "proposal" improves performance only for limited crawls (single web-server, etc.), and it is stupid for whole-web crawls. I created this issue because noticed some performance-related questions in mailing lists (I also sent such questions in August-September). Test Result: performance is good. I had one post related to "we are killing web-servers" - we send HTTP request, Server creates Client Thread, we send another HTTP request over another TCP Socket - I was wrong again, we are using shared TCP connection per host, and Server does not create 5 Client Threads for 5 HTTP requests; it uses single Thread whenever possible. > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > ----------------------------------------------------------------------- > > Key: NUTCH-109 > URL: http://issues.apache.org/jira/browse/NUTCH-109 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7, 0.6, 0.7.1, 0.8-dev > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > Reporter: Fuad Efendi > Attachments: protocol-httpclient-innovation-0.1.0.zip, test_results.txt > > 1. TCP connection costs a lot, not only for Nutch and end-point web servers, > but also for intermediary network equipment > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > I am using separate network segment with Windows XP (Nutch), and Suse Linux > (Apache HTTPD + 120,000 pages) > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > Please note: > Class HttpFactory contains cache of HTTPConnection objects; each object run > each thread; each object is absolutely thread-safe, so we can send multiple > GET requests using single instance: > private static int CLIENTS_PER_HOST = > NutchConf.get().getInt("http.clients.per.host", 3); > I'll add more comments after finishing tests... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira