Should be visible now on both the Gsoc app. and the Apache Wiki --- En date de : Mer 1.4.09, Ted Dunning <ted.dunn...@gmail.com> a écrit :
> De: Ted Dunning <ted.dunn...@gmail.com> > Objet: Re: [gsoc] random forests > À: mahout-dev@lucene.apache.org > Date: Mercredi 1 Avril 2009, 3h30 > Deneche, > > I don't see your application on the GSOC web site. > Nor on the apache wiki. > > Time is running out and I would hate to not see you in the > program. Is it > just that I can't see the application yet? > > On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim <a_dene...@yahoo.fr>wrote: > > > > > Here is a draft of my proposal > > > > ************************************************** > > Title/Summary: [Apache Mahout] Implement parallel > Random/Regression Forests > > > > Student: AbdelHakim Deneche > > Student e-mail: ... > > > > Student Major: Phd in Computer Science > > Student Degree: Master in Computer Science > > Student Graduation: Spring 2011 > > > > Organization: The Apache Software Foundation > > Assigned Mentor: > > > > > > Abstract: > > > > My goal is to add the power of random/regression > forests to Mahout. At the > > end of this summer one should be able to build > random/regression forests for > > large, possibly, distributed datasets, store the > forest and reuse it to > > classify new data. In addition, a demo on EC2 is > planned. > > > > > > Detailed Description: > > > > This project is all about random/regression forests. > The core component is > > the tree building algorithm from a random bootstrap > from the whole dataset. > > I already wrote a detailed description on Mahout Wiki > [RandomForests]. Given > > the size of the dataset, two distributed > implementation are possible: > > > > 1. The most straightforward one deals with relatively > small datasets. By > > small, I mean a dataset that can be replicated on > every node of the cluster. > > Basically, each mapper has access to the whole > dataset, so if the forest > > contains N trees and we have M mappers, each mapper > runs the core building > > algorithm N/M times. This implementation is, > relatively, easy because each > > mapper runs the basic building algorithm "as it is". > It is also of great > > interest if the user wants to "try" different > parameters when building the > > forest. An out-of-core implementation is also possible > to deal with datasets > > that cannot fit into the node memory. > > > > 2. The second implementation, which is the most > difficult, is concerned > > with very large datasets that cannot fit in every > machine of the cluster. In > > this case the mappers work differently, each mapper > has access to a subset > > from the dataset, thus all the mappers collaborate to > build each tree of the > > forest. The core building algorithm must thus be > rewritten in a map-reduce > > form. This implementation can deal with datasets of > any size, as long as > > they are on the cluster. > > > > Although the first implementation is easier to > implement, the CPU and IO > > overhead of the out-of-core implementation are still > unknown. A reference, > > non-parallel, implementation should thus be built to > better understand the > > effects of the out-of-core implementation, especially > for large datasets. > > This reference implementation is also usefull to asses > the correctness of > > the distributed implementation. > > > > > > Working Plan and list of deliverables > > > > Must-Have: > > 1. reference implementation of Random/Regression > Forests Building > > Algorithm: > > . Build a forest of trees, the basic algorithm > (described in the wiki) > > takes a subset from the dataset as a training set and > builds a decision > > tree. This algorithm is repeated for each tree of the > forest. > > . The forest is stored in a file, this way it > can be re-used, at any time, > > to classify new cases > > . At this step, the necessary changes to > Mahout's Classifier interface are > > made to extend its use to more than Text datasets. > > > > 2. Study the effects of large datasets on the > reference implementation > > . This step should guide our choice of the > proper parallel implementation > > > > 3. Parallel implementation, choose one of the > following: > > 3a. Parallel implementation A > > . When the dataset can be replicated to all > computing nodes. > > . Each mapper has access to the whole dataset, > if the forest contains N > > trees and we have M mappers, each mapper runs the > basic building algorithm > > N/M times. The mapper if also responsible of computing > the out-of-bag error > > estimation. > > . The reducer store the trees in the RF file, > and merges the oob error > > estimations. > > 3b. Parallel implementation B: > > . When the dataset is so big that it can no > longer fit on every computing > > node, it must be distributed over the cluster. > > . Each mapper has access to a subset from the > dataset, thus all the > > mappers collaborate to build each tree of the forest. > > . In this case, the basic algorithm must be > rewritten to fit in the > > map-reduce paradigm. > > > > Should-Have: > > 4. Run the Random Forest with a real dataset on EC2: > > . This step is important, because running the RF > on a local dual core > > machine is different from running it on a real cluster > with a real dataset. > > . This can make a good demo for Mahout > > . Amazon has put some interesting datasets to > play with [PublicDatasets]. > > The US Census dataset comes in > various sizes ranging from 2Go to 200Go, > > and should make a very good example. > > . At this stage it may be useful to implement > [MAHOUT-71] (Dataset to > > Matrix Reader). > > > > Wanna-Have: > > 5. If there is still time, implement one or two other > important features of > > RFs such as Variable importance and Proximity > estimation > > > > > > Additional Information: > > I am a PhD student at the University Mentouri of > Constantine. My primary > > research goal is a framework to help build Intelligent > Adaptive Systems. For > > the purpose of my Master, I worked on Artificial > Immune Systems. I applied > > them to handwritten digits recognition > [PatternRecognition] and Muliple > > Sequence Alignement (bioinformatics) > [BioInformatics]. > > Last year I participated in the GSoC as an Apache > student, I integrated an > > Evolutionary Computing Framework to Mahout > [Mahout-56]. > > > > References > > > > [BioInformatics] A. Layeb, A. Deneche, "Multiple > Sequence Alignment by > > Immune Artificial System", > > ACS/IEEE International Conference on Computer Systems > and Applications > > AICCSA’07, Jordan 2007. > > > > [Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56 > > > > [Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71 > > > > [PatternRecognition] S. Meshoul, A. Deneche, M. > Batouche, "Combining an > > Artificial Immune System with a Clustering Method for > Effective Pattern > > Recognition", > > International Conference on Machine Intelligence > ICMI’05, pp. 782-787, > > Tunis 2005. > > > > [PublicDatasets] http://aws.amazon.com/publicdatasets/ > > > > [RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html > > > > > *************************************************************************** > > > > > > > > > > > -- > Ted Dunning, CTO > DeepDyve > > 111 West Evelyn Ave. Ste. 202 > Sunnyvale, CA 94086 > www.deepdyve.com > 858-414-0013 (m) > 408-773-0220 (fax) >