Here is a draft of my proposal ************************************************** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests
Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm "as it is". It is also of great interest if the user wants to "try" different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation Additional Information: I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. For the purpose of my Master, I worked on Artificial Immune Systems. I applied them to handwritten digits recognition [PatternRecognition] and Muliple Sequence Alignement (bioinformatics) [BioInformatics]. Last year I participated in the GSoC as an Apache student, I integrated an Evolutionary Computing Framework to Mahout [Mahout-56]. References [BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by Immune Artificial System", ACS/IEEE International Conference on Computer Systems and Applications AICCSA’07, Jordan 2007. [Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56 [Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71 [PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an Artificial Immune System with a Clustering Method for Effective Pattern Recognition", International Conference on Machine Intelligence ICMI’05, pp. 782-787, Tunis 2005. [PublicDatasets] http://aws.amazon.com/publicdatasets/ [RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html ***************************************************************************