Here is a draft of my proposal

**************************************************
Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

Student: AbdelHakim Deneche
Student e-mail: ...

Student Major: Phd in Computer Science
Student Degree: Master in Computer Science
Student Graduation: Spring 2011

Organization: The Apache Software Foundation
Assigned Mentor:


Abstract:

My goal is to add the power of random/regression forests to Mahout. At the end 
of this summer one should be able to build random/regression forests for large, 
possibly, distributed datasets, store the forest and reuse it to classify new 
data. In addition, a demo on EC2 is planned.


Detailed Description:

This project is all about random/regression forests. The core component is the 
tree building algorithm from a random bootstrap from the whole dataset. I 
already wrote a detailed description on Mahout Wiki [RandomForests]. Given the 
size of the dataset, two distributed implementation are possible:

1. The most straightforward one deals with relatively small datasets. By small, 
I mean a dataset that can be replicated on every node of the cluster. 
Basically, each mapper has access to the whole dataset, so if the forest 
contains N trees and we have M mappers, each mapper runs the core building 
algorithm N/M times. This implementation is, relatively, easy because each 
mapper runs the basic building algorithm "as it is". It is also of great 
interest if the user wants to "try" different parameters when building the 
forest. An out-of-core implementation is also possible to deal with datasets 
that cannot fit into the node memory.

2. The second implementation, which is the most difficult, is concerned with 
very large datasets that cannot fit in every machine of the cluster. In this 
case the mappers work differently, each mapper has access to a subset from the 
dataset, thus all the mappers collaborate to build each tree of the forest. The 
core building algorithm must thus be rewritten in a map-reduce form. This 
implementation can deal with datasets of any size, as long as they are on the 
cluster.

Although the first implementation is easier to implement, the CPU and IO 
overhead of the out-of-core implementation are still unknown. A reference, 
non-parallel, implementation should thus be built to better understand the 
effects of the out-of-core implementation, especially for large datasets. This 
reference implementation is also usefull to asses the correctness of the 
distributed implementation.


Working Plan and list of deliverables

Must-Have:
1. reference implementation of Random/Regression Forests Building Algorithm:
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a 
subset from the dataset as a training set and builds a decision tree. This 
algorithm is repeated for each tree of the forest.
 . The forest is stored in a file, this way it can be re-used, at any time, to 
classify new cases
 . At this step, the necessary changes to Mahout's Classifier interface are 
made to extend its use to more than Text datasets.

2. Study the effects of large datasets on the reference implementation
 . This step should guide our choice of the proper parallel implementation

3. Parallel implementation, choose one of the following:
 3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N trees 
and we have M mappers, each mapper runs the basic building algorithm N/M times. 
The mapper if also responsible of computing the out-of-bag error estimation.
  . The reducer store the trees in the RF file, and merges the oob error 
estimations.
 3b. Parallel implementation B:
 . When the dataset is so big that it can no longer fit on every computing 
node, it must be distributed over the cluster.
 . Each mapper has access to a subset from the dataset, thus all the mappers 
collaborate to build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce 
paradigm.

Should-Have:
4. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine 
is different from running it on a real cluster with a real dataset.
 . This can make a good demo for Mahout
 . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go, and 
should make a very good example.
 . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix 
Reader).

Wanna-Have:
5. If there is still time, implement one or two other important features of RFs 
such as Variable importance and Proximity estimation


Additional Information:
I am a PhD student at the University Mentouri of Constantine. My primary 
research goal is a framework to help build Intelligent Adaptive Systems. For 
the purpose of my Master, I worked on Artificial Immune Systems. I applied them 
to handwritten digits recognition [PatternRecognition] and Muliple Sequence 
Alignement (bioinformatics) [BioInformatics].
Last year I participated in the GSoC as an Apache student, I integrated an 
Evolutionary Computing Framework to Mahout [Mahout-56].

References

[BioInformatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by Immune 
Artificial System",
ACS/IEEE International Conference on Computer Systems and Applications 
AICCSA’07, Jordan 2007.

[Mahout-56] https://issues.apache.org/jira/browse/MAHOUT-56

[Mahout-71] http://issues.apache.org/jira/browse/MAHOUT-71

[PatternRecognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an 
Artificial Immune System with a Clustering Method for Effective Pattern 
Recognition",
International Conference on Machine Intelligence ICMI’05, pp. 782-787, Tunis 
2005.

[PublicDatasets] http://aws.amazon.com/publicdatasets/

[RandomForests] http://cwiki.apache.org/MAHOUT/random-forests.html

*************************************************************************** 


  

Reply via email to