Yitao Duan wrote:
I see Nutch is moving towards using MapReduce for many things and there is already a branch that uses MapReduce for parsing and updatedb. I was wondering are there any benchmarks/tests validating the benefit of using the Nutch implementation of MapReduce, especially at large scale in a distributed setting? What tests have been done, and at what scale, for this MapReduce branch? I am doing my own testing but it is good to know what others have experienced.
I have not yet performed many benchmarks nor large-scale experiments. Once a critical mass of Nutch code has been ported to MapReduce then I will begin performance tuning and scale testing. Over the course of the summer I hope to demonstrate that Nutch can scale to billions of pages using tens of machines.
Doug
