Fuad, Several days for 120,000 pages? That's very slow. Could you show some status lines in the log file? (grep "status:") What's the bandwidth you have?
-AJ On 10/11/05, Fuad Efendi (JIRA) <[EMAIL PROTECTED]> wrote: > > [ http://issues.apache.org/jira/browse/NUTCH-109?page=all ] > > Fuad Efendi updated NUTCH-109: > ------------------------------ > > Summary: Nutch - Fetcher - Performance Test - new > Protocol-HTTPClient-Innovation (was: Nutch - Fetcher - HTTP - Performance > Testing & Tuning) > > I performed performance tests, using default Apache HTTPD Web-Server > installation, with crawled 120,000 pages (I used Teleport Ultra to crawl > HTML pages from www.apache.org <http://www.apache.org>, I spent probably > 10 hours) > > Everything run in a separate LAN, Windows XP (Client with Nutch 0.7.1), > and Suse Linux 9.3 (Server with Apache) > > I measured crawl for 21,000 pages (Depth=6, Threads=20) (it seems to take > few days to crawl all 120,000 pages): > > Protocol-HTTPClient-Innovation: > 1,321,470 milliseconds > > Protocol-HTTP: > 26,946,076 milliseconds > > Protocol-HttpClient: > 27,062,854 milliseconds > > > P.S. > Please note, Protocol-HTTPClient-Innovation plugin is only basic version, > v.0.1.0, > HttpFactory is growing and contains cache (3 TCP connections per Host) > http://www.innovation.ch/java/HTTPClient/ is very old but _production_ > level... style of a source code may seem too old... you may need to change > "enum" to "enumeration" in downloaded source files in order to compile it > :))) > > Very popular load-generating tool is based on HTTPClient (Innovation): > http://grinder.sourceforge.net/ > http://www.innovation.ch/java/HTTPClient/ > > > > Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation > > ----------------------------------------------------------------------- > > > > Key: NUTCH-109 > > URL: http://issues.apache.org/jira/browse/NUTCH-109 > > Project: Nutch > > Type: Improvement > > Components: fetcher > > Versions: 0.7, 0.8-dev, 0.6, 0.7.1 > > Environment: Nutch: Windows XP, J2SE 1.4.2_09 > > Web Server: Suse Linux, Apache HTTPD, apache2-worker, v. 2.0.53 > > Reporter: Fuad Efendi > > Attachments: protocol-httpclient-innovation-0.1.0.zip > > > > 1. TCP connection costs a lot, not only for Nutch and end-point web > servers, but also for intermediary network equipment > > 2. Web Server creates Client thread and hopes that Nutch really uses > HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM > "Socket.close()" ... > > I need to perform very objective tests, probably 2-3 days; new plugin > crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing > http-plugin needs few days... > > I am using separate network segment with Windows XP (Nutch), and Suse > Linux (Apache HTTPD + 120,000 pages) > > Please find attached new plugin based on > http://www.innovation.ch/java/HTTPClient/ > > Please note: > > Class HttpFactory contains cache of HTTPConnection objects; each object > run each thread; each object is absolutely thread-safe, so we can send > multiple GET requests using single instance: > > private static int CLIENTS_PER_HOST = NutchConf.get().getInt(" > http.clients.per.host", 3); > > I'll add more comments after finishing tests... > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the administrators: > http://issues.apache.org/jira/secure/Administrators.jspa > - > For more information on JIRA, see: > http://www.atlassian.com/software/jira > >
