Some suggestion to improve performance:
1. Decrease randomization of FetchList.
Here is comment from FetchListTool:
/**
* The TableSet class will allocate a given FetchListEntry
* into one of several ArrayFiles. It chooses which
* ArrayFile based on a hash of the URL's
Does any body know how I can unsubscribe from this mailing list?
Thanks,
Nima
Update on fetch performance of my current run: download speed has been
stable at 3.8 pages/sec, about 640kbps. This is probably limited by my
bandwidth - regular DSL service, promising up to 1.5 mbps inbound but
realistically only 640 kbps.
More than 1 million pages were fetched, but it took
http://lucene.apache.org/nutch/mailing_lists.html
--- [EMAIL PROTECTED] wrote:
Does any body know how I can unsubscribe from this
mailing list?
Thanks,
Nima
__
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Kelvin's OC implementation is queuing fetching request
according to the host and using http 1.1 protocol. It
is a nutch patch currently.
Michael Ji,
--- Fuad Efendi [EMAIL PROTECTED] wrote:
Some suggestion to improve performance:
1. Decrease randomization of FetchList.
Here is comment
Update on fetch performance of my current run: download speed has been
stable at 3.8 pages/sec, about 640kbps. This is probably limited by my
bandwidth - regular DSL service, promising up to 1.5 mbps inbound but
realistically only 640 kbps.
More than 1 million pages were fetched, but it took
Correction to my previous post. I'd said:
When you use the FetchListTool to emit multiple lists, it
intentionally divides up the list using the MD5 value for the link,
so that you get hosts scattered between the lists. But for a single
list, this doesn't happen, and thus the max threads/host
Unfortunately this is commented in Kelvin's code:
// reqStr.append(Connection: Keep-Alive\r\n);
I found only
reqStr.append( HTTP/1.1\r\n);
- but it does not mean implementation of HTTP/1.1 features.
Teleport Ultra v.1.29 needs just a few hours to download all plain HTML
from SUN,
I never tried Kelvin's OC, I only browsed source code a little.
We need to make test with JVM 1.4, and JVM 1.5 (Kelvin's OC).
If I am right, we are simply _killing_ many many sites with default
Apache HTTPD installation (Microsoft IIS, etc.) (150 keep-alive client
threads; I configured 6000
I did a clean, full svn update, and ant on trunk, then
tried
bin/nutch crawl urls -dir crawl.test
and got
051002 224950 SEVERE Unable to load parse plugins file
from URL [parse-plugins.xml]
java.net.MalformedURLException: no protocol: ...
Likely missing file:/. If I get rid of lines 617-622
10 matches
Mail list logo