Hello, I just wanted to let you know that during the last few months I was invited by several (machine learning/ information retrieval/ database) research groups here in Berlin to tell them more on Mahout and give a brief overview of Hadoop.
Usually I gave two example applications, explained the main motivation for Mahout, introduced Hadoop at a very high level, showed some strategies for coming up with parallel solutions. After that I included an overview of existing implementations in Mahout, gave an overview of why and how participation is possible. My impression from those talks was that people are pretty interested in what is going on here. Some have setup their own Hadoop cluster and run experiments on it. Some are planning to do so in the near future. A few even expressed interest in contributing to the project. There were a few common reactions/ observations that I would like to share with you - comments, corrections, additions very welcome: People seem to slowly become aware that there is something named Hadoop that implements a framework for parallel programming once developed at Google. However the basic assumptions and implications (e.g. data locality) are known only by few groups/ people at least in the IR and data mining domains. Anytime I asked people using Apache software as to whether they are subscribed to the corresponding user mailinglist the answer was a questioning face and no as an answer. I tried to make clear why participation is important - I guess we will see in the near future whether I was successful ;) I was surprised to see people only vaguely aware of the GSoC program. They knew that it does exist, but the general setup was not as widely known as I would have expected it to be. After all in our GSoC proposals there seemed to be quite a few students co-supervised by their university. Concerning Mahout I got varying feedback: There were a few that had a look at it last autumn that found it difficult to find the sourcecode and documentation. Some students had a look shortly after Apache Con EU this year and found it hard to setup a demo application. I think having some JavaDoc, tutorial, setup sort of documentation for each release version on our website might help people getting started easier? Other than that general feedback seemed to be that we are doing "surprisingly well" both in terms of emerging community and in terms of implementation progress over the first year. Last but not least: From DIMA at TU Berlin I received the offer to do a "Mahout seminar". It would consist of two parts: A theoretical one where students read scientific publications, prepare a survey and give a talk by the end of the semester. The other part would be a project where they could work for instance on some algorithm implementation or integrate already existing implementations in a project. Goal would be to strengthen their programming and project management skills and along the way make them contribute back to the community. My first thought was to prepare a task with the goal of building a new blog "search engine". They could build a system that identifies clusters of blogs on a common topic, work on the link graph in the blogosphere, detect new emerging topics and the like. Before preparing the final seminar proposal, I would like to ask you whether there is anything you might want those students to work on during their winter-term. Sorry for the overly longish e-mail... Isabel -- |\ _,,,---,,_ Web: <http://www.isabel-drost.de> /,`.-'`' -. ;-;;,_ |,4- ) )-,_..;\ ( `'-' '---''(_/--' `-'\_) (fL) IM: <xmpp://[email protected]>
signature.asc
Description: This is a digitally signed message part.
