Roger Dunk wrote:
Andrzej stated in NUTCH-669 that "some people reported performance issues with Fetcher2, i.e. that it doesn't use the available bandwidth. These reports are unconfirmed, and they may have been caused by suboptimal URL / host distribution in a fetchlist - but it would be good to review the synchronization and threading aspects of Fetcher2."

To address this, I've tried just now generating a fetchlist using generate.max.per.host = 1 (which gave me 35,000 unique hosts) to guarantee unique hosts, but the problem still remains.

Therefore, I believe it's clearly not an issue of suboptimal URL / host distribution. If you require any further information to confirm my report, you need only ask!

Thanks for reporting this. Yes, we need more information - it's best if you create a JIRA issue, because then it will be easier to send attachments.

What we need at this moment is:

* the fetchlist - just zip the crawl_generate and attach it.
* nutch-site.xml and hadoop-site.xml (if you run in a distributed mode).
* cmd-line parameters, specifically the number of threads and -noParsing
* information about your environment (OS, cpu/mem, heapsize, JVM version).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to