Re: MapReduce scalability study
On Thu, May 22, 2014 at 04:47:28PM -0400, Marcos Ortiz wrote: > On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote: > > Hello, > > > > I'm new to this mailing list, so forgive me if I don't do everything > > right. > > > > I didn't know whether I should ask on this mailing list or on > > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > > > Short story: I'm looking for some paper(s) studying the scalability > > of Hadoop MapReduce. And I found this extremely difficult to find on > > google scholar. Do you have something worth citing in a PhD thesis? > > > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > > about Hadoop I'd like to say "how much it scales". I heared two years > > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > > to try on 6000 nodes" or something like that. I also heared that > > YARN/MRv2 should scale better, but I don't plan to talk much about > > YARN/MRv2. So I'd take anything I could cite as a reference in my > > manuscript. :) > > Hello, Sylvain. > > One of the reason why the Hadoop dev team began to work in YARN is precisely > looking for a more scalable and resourceful Hadoop system, so if you actually > want to talk about Hadoop scalability, you should talk about YARN and MR2. > > > > The paper is here: > > https://developer.yahoo.com/blogs/hadoop/ > next-generation-apache-hadoop-mapreduce-3061.html > This was a very interesting reading. Maybe not very academic, but if that's all we got, I take it. I also found these: https://developer.yahoo.com/blogs/hadoop/scaling-hadoop-4000-nodes-yahoo-410.html https://developer.yahoo.com/blogs/hadoop/hadoop-sorts-petabyte-16-25-hours-terabyte-62-422.html Somehow I was expecting that someone did a real scalability study comparing MRv2 and MRv1. Comparing the total time of several benchmark for a number of nodes 1000, 2000, ... 6000. And plotting some curves. :) But that's just how I would have done it. :) > You should talk with Arun C Murthy, Chief Architect at Hortonworks about all > these topics. He could help you much more than I could. I'm convinced it would be very very interesting. But I do not have much time to spend on understanding Hadoop and I still have several chapters to write. :) I almost have everything I needed to know about Hadoop. But when I'm done, I may also ask people here to proof-read what I wrote about it. :) Sylvain
Re: MapReduce scalability study
I only talk about Hadoop because it is the de-facto implementation of MapReduce. But for the remaining of my thesis, I took a more general approach and implemented my algorithms in a custom MapReduce implentation. I learned yesterday about the existence of YARN. :D And I definitely can't not talk about it since it's the future and 1.x will be abandoned. But I mostly know about MRv1, so I decided to only briefly talk about MRv2 when the difference are relevant. i.e. for scalability and global architecture I guess. Sylvain On Thu, May 22, 2014 at 05:39:43PM -0300, Marco Shaw wrote: > I would consider the timeframe that you are looking for to determine if you > should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better > than 1.x. > > Keep in mind that 2.x was only "officially" released late last year. > > Marco > > > On May 22, 2014, at 5:17 PM, Sylvain Gault wrote: > > > > Hello, > > > > I'm new to this mailing list, so forgive me if I don't do everything > > right. > > > > I didn't know whether I should ask on this mailing list or on > > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > > > Short story: I'm looking for some paper(s) studying the scalability > > of Hadoop MapReduce. And I found this extremely difficult to find on > > google scholar. Do you have something worth citing in a PhD thesis? > > > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > > about Hadoop I'd like to say "how much it scales". I heared two years > > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > > to try on 6000 nodes" or something like that. I also heared that > > YARN/MRv2 should scale better, but I don't plan to talk much about > > YARN/MRv2. So I'd take anything I could cite as a reference in my > > manuscript. :) > > > > > > Best regards, > > Sylvain Gault
Re: MapReduce scalability study
On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote: > Hello, > > I'm new to this mailing list, so forgive me if I don't do everything > right. > > I didn't know whether I should ask on this mailing list or on > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > Short story: I'm looking for some paper(s) studying the scalability > of Hadoop MapReduce. And I found this extremely difficult to find on > google scholar. Do you have something worth citing in a PhD thesis? > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > about Hadoop I'd like to say "how much it scales". I heared two years > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > to try on 6000 nodes" or something like that. I also heared that > YARN/MRv2 should scale better, but I don't plan to talk much about > YARN/MRv2. So I'd take anything I could cite as a reference in my > manuscript. :) Hello, Sylvain. One of the reason why the Hadoop dev team began to work in YARN is precisely looking for a more scalable and resourceful Hadoop system, so if you actually want to talk about Hadoop scalability, you should talk about YARN and MR2. The paper is here: https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html and the related JIRA issues here: https://issues.apache.org/jira/browse/MAPREDUCE-278 https://issues.apache.org/jira/browse/MAPREDUCE-279 You should talk with Arun C Murthy, Chief Architect at Hortonworks about all these topics. He could help you much more than I could. -- Marcos Ortiz[1] (@marcosluis2186[2]) http://about.me/marcosortiz[3] > > > Best regards, > Sylvain Gault [1] http://www.linkedin.com/in/mlortiz [2] http://twitter.com/marcosluis2186 [3] http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: MapReduce scalability study
I would consider the timeframe that you are looking for to determine if you should focus on Hadoop 2.x (with YARN) or older. 2.x should scale much better than 1.x. Keep in mind that 2.x was only "officially" released late last year. Marco > On May 22, 2014, at 5:17 PM, Sylvain Gault wrote: > > Hello, > > I'm new to this mailing list, so forgive me if I don't do everything > right. > > I didn't know whether I should ask on this mailing list or on > mapreduce-dev or on yarn-dev. So I'll just start there. ^^ > > Short story: I'm looking for some paper(s) studying the scalability > of Hadoop MapReduce. And I found this extremely difficult to find on > google scholar. Do you have something worth citing in a PhD thesis? > > Long story: I'm writing my PhD thesis about MapReduce and when I talk > about Hadoop I'd like to say "how much it scales". I heared two years > ago some people say that "Yahoo! got it scale up to 4000 nodes and plan > to try on 6000 nodes" or something like that. I also heared that > YARN/MRv2 should scale better, but I don't plan to talk much about > YARN/MRv2. So I'd take anything I could cite as a reference in my > manuscript. :) > > > Best regards, > Sylvain Gault