Dear Mahout developers, I am a Belgian student in Computer Science and I'd be very interested in working on Mahout this summer! I am actually a soon-to-be PhD student in large-scale machine learning. To be honest however, I never had the chance to use Mahout yet, but I am very eager to test it out and to contribute! I indeed think that this would be a very good start to dive into the subject of my PhD thesis.
I am actually quite tempted by some of the GSoC projects you proposed, including MAHOUT-327 and MAHOUT-342. For MAHOUT-327, I propose to implement Extra-trees [1]. This method is a tree-based ensemble method for supervised classification and regression problems. It is actually quite close to Random Forests. The main difference is that cut-points are drawn at random when splitting a node, and then the best one is used to split the current node. (In Random Forests, the actual best cut-points are computed for each of the K random attributes, they are not drawn at random.) I know that a RF module is already integrated in Mahout and I believe that it could be a good complement in the algorithm toolbox. However, I wonder if that task would be large enough to constitute a GSoC project in itself, since the implementation could be quite close to the RF one. Anyhow, I am also very interested in implementing Neural Networks over Map/Reduce (Mahout-342). If I understand correctly [2], this is indeed still lacking in Mahout. [1]: http://www.montefiore.ulg.ac.be/~ernst/extremely-randomized-trees.pdf [2]: http://cwiki.apache.org/MAHOUT/algorithms.html >From a more legal point of view, I was also wondering if I would be allowed to reuse what I'd have implemented in this GSoC for my own research? Since Mahout is open-source, I guess this is perfectly fine, but what about maybe writing a publication somehow related or things like that? Best regards, Gilles Louppe MSc student in Computer Sciences Université de Liège (Belgium)