On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote: > On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote: > > Hello, > > > > I'm new to this mailing list, so forgive me if I don't do everything > > right. > > > > I didn't know whether I should ask on this mailing list or on > > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > > > Short story: I'm looking for some paper(s) studying the scalability > > of Hadoop MapReduce. And I found this extremely difficult to find on > > google scholar. Do you have something worth citing in a PhD thesis? > > > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > > about Hadoop I'd like to say "how much it scales". I heared two years > > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > > to try on 6000 nodes" or something like that. I also heared that > > YARN/MRv2 should scale better, but I don't plan to talk much about > > YARN/MRv2. So I'd take anything I could cite as a reference in my > > manuscript. :) > > Hello, Sylvain. > > One of the reason why the Hadoop dev team began to work in YARN is precisely > looking for a more scalable and resourceful Hadoop system, so if you actually > want to talk about Hadoop scalability, you should talk about YARN and MR2. > > > > The paper is here: > > https://developer.yahoo.com/blogs/hadoop/ > next-generation-apache-hadoop-mapreduce-3061.html >
This was a very interesting reading. Maybe not very academic, but if that's all we got, I take it. I also found these: https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html Somehow I was expecting that someone did a real scalability study comparing MRv2 and MRv1. Comparing the total time of several benchmark for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :) But that's just how I would have done it. :) > You should talk with Arun C Murthy, Chief Architect at Hortonworks about all > these topics. He could help you much more than I could. I'm convinced it would be very very interesting. But I do not have much time to spend on understanding Hadoop and I still have several chapters to write. :) I almost have everything I needed to know about Hadoop. But when I'm done, I may also ask people here to proof-read what I wrote about it. :) Sylvain