Re: Fetcher2 Slow

Andrzej Bialecki Tue, 17 Mar 2009 04:59:34 -0700

Roger Dunk wrote:

Andrzej stated in NUTCH-669 that "some people reported performanceissues with Fetcher2, i.e. that it doesn't use the available bandwidth.These reports are unconfirmed, and they may have been caused bysuboptimal URL / host distribution in a fetchlist - but it would be goodto review the synchronization and threading aspects of Fetcher2."
To address this, I've tried just now generating a fetchlist usinggenerate.max.per.host = 1 (which gave me 35,000 unique hosts) toguarantee unique hosts, but the problem still remains.
Therefore, I believe it's clearly not an issue of suboptimal URL / hostdistribution. If you require any further information to confirm myreport, you need only ask!

Thanks for reporting this. Yes, we need more information - it's best ifyou create a JIRA issue, because then it will be easier to send attachments.


What we need at this moment is:

* the fetchlist - just zip the crawl_generate and attach it.
* nutch-site.xml and hadoop-site.xml (if you run in a distributed mode).
* cmd-line parameters, specifically the number of threads and -noParsing
* information about your environment (OS, cpu/mem, heapsize, JVM version).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fetcher2 Slow

Reply via email to