I'm actually writing my working plan, and it looks like this:

*****************************************************************
1. reference implementation of Random/Regression Forests Building Algorithm: 
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a 
subset from the dataset as a training set and builds a decision tree. This 
basic algorithm is repeated for each tree of the forest. 
 . The forest is stored in a file, this way it can be used later to classify 
new cases

2a. distributed Implementation A: 
 . When the dataset can be replicated to all computing nodes.
 . Each mapper has access to the whole dataset, if the forest contains N trees 
and we have M mappers, each mapper runs the basic building algorithm N/M times.
 . This implementation is, relatively, given that the reference implementation 
is available, because each mapper runs the basic building algorithm "as it is".

2b. Distributed Implementation B:
 . When the dataset is so big that it can no longer fit on every computing 
node, it must be distributed over the cluster. 
 . Each mapper has access to a subset from the dataset, thus all the mappers 
collaborate to build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce 
paradigm.

3. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine 
is way different from running it on a real cluster with a real dataset.
 . This can make for a good demo for Mahout

4. If there is still time, implement one or two other important features of RFs 
such as Variable importance and Proximity estimation
*****************************************************************

It is clear from the plan that I won't be able to do all those steps, and in 
some way I must choose only one implementation (2a or 2b) to do. The first 
implementation should take less time to implement than 2b and I'm quite sure I 
can go up to the 4th step, adding other features to the RF. BUT the second 
implementation is the only one capable of dealing with very large distributed 
datasets.

My question is : when Mahout.RF will be used in a real application, what are 
the odds that the dataset will be so large that it can't fit on every machine 
of the cluster ? 

the answer to this question should help me decide which implementation I'll 
choose.

--- En date de : Dim 22.3.09, Ted Dunning <ted.dunn...@gmail.com> a écrit :

> De: Ted Dunning <ted.dunn...@gmail.com>
> Objet: Re: [gsoc] random forests
> À: mahout-dev@lucene.apache.org
> Date: Dimanche 22 Mars 2009, 0h36
> Great expression!
> 
> You may be right about the nose-bleed tendency between the
> two methods.
> 
> On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:
> 
> > I can't find a no-nose-bleeding algorithm
> 
> 
> 
> 
> -- 
> Ted Dunning, CTO
> DeepDyve
> 



Reply via email to