> I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0 > and cygwin! I have a few questions about configuration.
Glad to hear! > > I am only going to be crawling a few domains and I need this cluster to be > very fast. Right now it is slower using hadoop in distributed mode then > using just the local crawl. I am *guessing* that is due to the network > overhead? It is very, very slow. You need to know which component is slow, parse? fetch? update? etc. Keep in mind that HDFS replicates blocks, that takes signifcant I/O. > > What settings in mapred-site.xml and hdfs-site.xml might make my crawl > faster? Impossible to tell. > Seems like the crawldb update takes the longest. Perhaps you filter and normalize in that step? You might not need to as parsing already does it. > I was digging > around in the hadoop documentation and the following seemed like good > settings: Good defaults on most systems > > mapred.reduce.tasks = <2 x slave processors> > mapred.map.tasks = <10 x the number of slave processors> > > increase mapred.child.opts memory Only if you run out of memory. > > Any thing else I am missing? What about running another crawl cycle > immediately after the first generate is complete? Would that cause problem > with concurrency and updating files/dbs? Yes although there's an option to solve that. Better generate many segments in one go. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Finally-got-hadoop-nutch-1-3-cygwin-clu > ster-working-now-tp3380170p3380170.html Sent from the Nutch - User mailing > list archive at Nabble.com.

