Tim, I looked there, but it is a set up manual. I read the MapReduce, Sazall, and the MS paper on these, but I need "best practices."
Thank you, Mark On Fri, Jan 23, 2009 at 3:22 PM, tim robertson <timrobertson...@gmail.com>wrote: > Hi, > > Sounds like you might want to look at the Nutch project architecture > and then see the Nutch on Hadoop tutorial - > http://wiki.apache.org/nutch/NutchHadoopTutorial It does web > crawling, and indexing using Lucene. It would be a good place to > start anyway for ideas, even if it doesn't end up meeting your exact > needs. > > Cheers, > > Tim > > > On Fri, Jan 23, 2009 at 10:11 PM, Mark Kerzner <markkerz...@gmail.com> > wrote: > > Hi, esteemed group, > > how would I form Maps in MapReduce to recursevely look at every file in a > > directory, and do something to this file, such as produce a PDF or > compute > > its hash? > > > > For that matter, Google builds its index using MapReduce, or so the > papers > > say. First the crawlers store all the files. Are their papers on the > > architecture of this? How and where do the crawlers store the downloaded > > files? > > > > Thank you, > > Mark > > >