I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0 and cygwin! I have a few questions about configuration.
I am only going to be crawling a few domains and I need this cluster to be very fast. Right now it is slower using hadoop in distributed mode then using just the local crawl. I am *guessing* that is due to the network overhead? It is very, very slow. What settings in mapred-site.xml and hdfs-site.xml might make my crawl faster? Seems like the crawldb update takes the longest. I was digging around in the hadoop documentation and the following seemed like good settings: mapred.reduce.tasks = <2 x slave processors> mapred.map.tasks = <10 x the number of slave processors> increase mapred.child.opts memory Any thing else I am missing? What about running another crawl cycle immediately after the first generate is complete? Would that cause problem with concurrency and updating files/dbs? -- View this message in context: http://lucene.472066.n3.nabble.com/Finally-got-hadoop-nutch-1-3-cygwin-cluster-working-now-tp3380170p3380170.html Sent from the Nutch - User mailing list archive at Nabble.com.

