Gilles, These both sound like reasonable topics, but the second one is distinctly more interesting. In particular, it would be very interesting if you could use some of John Langford's ideas from his NIPS 2009 lecture where he shards on features. If you combine that with the ideas of random forests where subsets of features are chosen somewhat at random rather than being a true partition, then you might really have a strong project.
Regarding the reuse of the code you write, you would be completely free to publish or reuse or sell the work you contribute to Mahout. Not only that, but you can use all of the Mahout code that you didn't write and anybody else can use all of Mahout in any way that they like as well. Open really does mean open with Apache software. On Wed, Mar 24, 2010 at 2:36 PM, Gilles Louppe <g.lou...@gmail.com> wrote: > Dear Mahout developers, > > I am a Belgian student in Computer Science and I'd be very interested > in working on Mahout this summer! I am actually a soon-to-be PhD > student in large-scale machine learning. To be honest however, I never > had the chance to use Mahout yet, but I am very eager to test it out > and to contribute! I indeed think that this would be a very good start > to dive into the subject of my PhD thesis. > > I am actually quite tempted by some of the GSoC projects you proposed, > including MAHOUT-327 and MAHOUT-342. > > For MAHOUT-327, I propose to implement Extra-trees [1]. This method is > a tree-based ensemble method for supervised classification and > regression problems. It is actually quite close to Random Forests. The > main difference is that cut-points are drawn at random when splitting > a node, and then the best one is used to split the current node. (In > Random Forests, the actual best cut-points are computed for each of > the K random attributes, they are not drawn at random.) I know that a > RF module is already integrated in Mahout and I believe that it could > be a good complement in the algorithm toolbox. However, I wonder if > that task would be large enough to constitute a GSoC project in > itself, since the implementation could be quite close to the RF one. > > Anyhow, I am also very interested in implementing Neural Networks over > Map/Reduce (Mahout-342). If I understand correctly [2], this is indeed > still lacking in Mahout. > > [1]: > http://www.montefiore.ulg.ac.be/~ernst/extremely-randomized-trees.pdf<http://www.montefiore.ulg.ac.be/%7Eernst/extremely-randomized-trees.pdf> > [2]: http://cwiki.apache.org/MAHOUT/algorithms.html > > From a more legal point of view, I was also wondering if I would be > allowed to reuse what I'd have implemented in this GSoC for my own > research? Since Mahout is open-source, I guess this is perfectly fine, > but what about maybe writing a publication somehow related or things > like that? > > Best regards, > > Gilles Louppe > MSc student in Computer Sciences > Université de Liège (Belgium) >