Re: [gsoc] random forests

2009-03-31 Thread deneche abdelhakim

Here is a draft of my proposal

**
Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

Student: AbdelHakim Deneche
Student e-mail: ...

Student Major: Phd in Computer Science
Student Degree: Master in Computer Science
Student Graduation: Spring 2011

Organization: The Apache Software Foundation
Assigned Mentor:


Abstract:

My goal is to add the power of random/regression forests to Mahout. At the end 
of this summer one should be able to build random/regression forests for large, 
possibly, distributed datasets, store the forest and reuse it to classify new 
data. In addition, a demo on EC2 is planned.


Detailed Description:

This project is all about random/regression forests. The core component is the 
tree building algorithm from a random bootstrap from the whole dataset. I 
already wrote a detailed description on Mahout Wiki [RandomForests]. Given the 
size of the dataset, two distributed implementation are possible:

1. The most straightforward one deals with relatively small datasets. By small, 
I mean a dataset that can be replicated on every node of the cluster. 
Basically, each mapper has access to the whole dataset, so if the forest 
contains N trees and we have M mappers, each mapper runs the core building 
algorithm N/M times. This implementation is, relatively, easy because each 
mapper runs the basic building algorithm as it is. It is also of great 
interest if the user wants to try different parameters when building the 
forest. An out-of-core implementation is also possible to deal with datasets 
that cannot fit into the node memory.

2. The second implementation, which is the most difficult, is concerned with 
very large datasets that cannot fit in every machine of the cluster. In this 
case the mappers work differently, each mapper has access to a subset from the 
dataset, thus all the mappers collaborate to build each tree of the forest. The 
core building algorithm must thus be rewritten in a map-reduce form. This 
implementation can deal with datasets of any size, as long as they are on the 
cluster.

Although the first implementation is easier to implement, the CPU and IO 
overhead of the out-of-core implementation are still unknown. A reference, 
non-parallel, implementation should thus be built to better understand the 
effects of the out-of-core implementation, especially for large datasets. This 
reference implementation is also usefull to asses the correctness of the 
distributed implementation.


Working Plan and list of deliverables

Must-Have:
1. reference implementation of Random/Regression Forests Building Algorithm:
 . Build a forest of trees, the basic algorithm (described in the wiki) takes a 
subset from the dataset as a training set and builds a decision tree. This 
algorithm is repeated for each tree of the forest.
 . The forest is stored in a file, this way it can be re-used, at any time, to 
classify new cases
 . At this step, the necessary changes to Mahout's Classifier interface are 
made to extend its use to more than Text datasets.

2. Study the effects of large datasets on the reference implementation
 . This step should guide our choice of the proper parallel implementation

3. Parallel implementation, choose one of the following:
 3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N trees 
and we have M mappers, each mapper runs the basic building algorithm N/M times. 
The mapper if also responsible of computing the out-of-bag error estimation.
  . The reducer store the trees in the RF file, and merges the oob error 
estimations.
 3b. Parallel implementation B:
 . When the dataset is so big that it can no longer fit on every computing 
node, it must be distributed over the cluster.
 . Each mapper has access to a subset from the dataset, thus all the mappers 
collaborate to build each tree of the forest.
 . In this case, the basic algorithm must be rewritten to fit in the map-reduce 
paradigm.

Should-Have:
4. Run the Random Forest with a real dataset on EC2:
 . This step is important, because running the RF on a local dual core machine 
is different from running it on a real cluster with a real dataset.
 . This can make a good demo for Mahout
 . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go, and 
should make a very good example.
 . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix 
Reader).

Wanna-Have:
5. If there is still time, implement one or two other important features of RFs 
such as Variable importance and Proximity estimation


Additional Information:
I am a PhD student at the University Mentouri of Constantine. My primary 
research goal is a framework to help build Intelligent Adaptive Systems. For 
the purpose of my Master, I worked on 

Re: [gsoc] random forests

2009-03-31 Thread Ted Dunning
Deneche,

I don't see your application on the GSOC web site.  Nor on the apache wiki.

Time is running out and I would hate to not see you in the program.  Is it
just that I can't see the application yet?

On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim a_dene...@yahoo.frwrote:


 Here is a draft of my proposal

 **
 Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests

 Student: AbdelHakim Deneche
 Student e-mail: ...

 Student Major: Phd in Computer Science
 Student Degree: Master in Computer Science
 Student Graduation: Spring 2011

 Organization: The Apache Software Foundation
 Assigned Mentor:


 Abstract:

 My goal is to add the power of random/regression forests to Mahout. At the
 end of this summer one should be able to build random/regression forests for
 large, possibly, distributed datasets, store the forest and reuse it to
 classify new data. In addition, a demo on EC2 is planned.


 Detailed Description:

 This project is all about random/regression forests. The core component is
 the tree building algorithm from a random bootstrap from the whole dataset.
 I already wrote a detailed description on Mahout Wiki [RandomForests]. Given
 the size of the dataset, two distributed implementation are possible:

 1. The most straightforward one deals with relatively small datasets. By
 small, I mean a dataset that can be replicated on every node of the cluster.
 Basically, each mapper has access to the whole dataset, so if the forest
 contains N trees and we have M mappers, each mapper runs the core building
 algorithm N/M times. This implementation is, relatively, easy because each
 mapper runs the basic building algorithm as it is. It is also of great
 interest if the user wants to try different parameters when building the
 forest. An out-of-core implementation is also possible to deal with datasets
 that cannot fit into the node memory.

 2. The second implementation, which is the most difficult, is concerned
 with very large datasets that cannot fit in every machine of the cluster. In
 this case the mappers work differently, each mapper has access to a subset
 from the dataset, thus all the mappers collaborate to build each tree of the
 forest. The core building algorithm must thus be rewritten in a map-reduce
 form. This implementation can deal with datasets of any size, as long as
 they are on the cluster.

 Although the first implementation is easier to implement, the CPU and IO
 overhead of the out-of-core implementation are still unknown. A reference,
 non-parallel, implementation should thus be built to better understand the
 effects of the out-of-core implementation, especially for large datasets.
 This reference implementation is also usefull to asses the correctness of
 the distributed implementation.


 Working Plan and list of deliverables

 Must-Have:
 1. reference implementation of Random/Regression Forests Building
 Algorithm:
  . Build a forest of trees, the basic algorithm (described in the wiki)
 takes a subset from the dataset as a training set and builds a decision
 tree. This algorithm is repeated for each tree of the forest.
  . The forest is stored in a file, this way it can be re-used, at any time,
 to classify new cases
  . At this step, the necessary changes to Mahout's Classifier interface are
 made to extend its use to more than Text datasets.

 2. Study the effects of large datasets on the reference implementation
  . This step should guide our choice of the proper parallel implementation

 3. Parallel implementation, choose one of the following:
  3a. Parallel implementation A
  . When the dataset can be replicated to all computing nodes.
  . Each mapper has access to the whole dataset, if the forest contains N
 trees and we have M mappers, each mapper runs the basic building algorithm
 N/M times. The mapper if also responsible of computing the out-of-bag error
 estimation.
  . The reducer store the trees in the RF file, and merges the oob error
 estimations.
  3b. Parallel implementation B:
  . When the dataset is so big that it can no longer fit on every computing
 node, it must be distributed over the cluster.
  . Each mapper has access to a subset from the dataset, thus all the
 mappers collaborate to build each tree of the forest.
  . In this case, the basic algorithm must be rewritten to fit in the
 map-reduce paradigm.

 Should-Have:
 4. Run the Random Forest with a real dataset on EC2:
  . This step is important, because running the RF on a local dual core
 machine is different from running it on a real cluster with a real dataset.
  . This can make a good demo for Mahout
  . Amazon has put some interesting datasets to play with [PublicDatasets].
   The US Census dataset comes in various sizes ranging from 2Go to 200Go,
 and should make a very good example.
  . At this stage it may be useful to implement [MAHOUT-71] (Dataset to
 Matrix Reader).

 Wanna-Have:
 5. If there is still time, 

Re: [gsoc] random forests

2009-03-30 Thread deneche abdelhakim

Thank you for your answer, it just made me aware of many hidden-possible-future 
problems with my implementation.

 The first is that for any given application, the odds that
 the data will not fit in a single machine are small, especially if you 
 have an out-of-core tree builder.  Really, really big datasets are
 increasingly common, but are still a small minority of all datasets.

by out-of-core you mean the builder can fetch the data directly from a file 
instead of working from in-memory only (?)

 One question I have about your plan is whether your step (1) involves
 building trees or forests only from data held in memory or whether it 
 can be adapted to stream through the data (possibly several
 times).  If a streaming implementation is viable, then it may well be 
 that performance is still quite good for small datasets due to buffering.

I was planning to distribute the dataset files to all workers using Hadoop's 
DistributedCache. I think that a streaming implementation is feasible, the 
basic tree building algorithm (described here 
http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream 
through the data (either in-memory or from a file) for each node of the tree. 
During this pass, it computes the information gain (IG) for the selected 
variables. 
This algorithm could be improved to compute the IG's for a list of nodes, thus 
reducing the total number of passes through the data. When building the forest, 
the list of nodes comes from all the trees built by the mapper.

 Another way to put this is that the key question is how single node
 computation scales with input size.  If the scaling is relatively linear
 with data size, then your approach (3) will work no matter the data size.
 If scaling shows an evil memory size effect, then your approach (2) 
 would be required for large data sets.

I'll have to run some tests before answering this question, but I think that 
the memory usage of the improved algorithm (described above) will mainly be 
needed to store the IG's computations (variable probabilities...). One way to 
limit the memory usage is to limit the number of tree-nodes computed at each 
data pass. Increasing this limit should reduce the data passes but increase the 
memory usage, and vice versa.

There is still one case that this approach, even out-of-core, cannot handle: 
very large datasets that cannot fit in the node hard-drive, and thus must be 
distributed across the cluster.

abdelHakim
--- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 30 Mars 2009, 0h59
 I have two answers for you.
 
 The first is that for any given application, the odds that
 the data will not
 fit in a single machine are small, especially if you have
 an out-of-core
 tree builder.  Really, really big datasets are
 increasingly common, but are
 still a small minority of all datasets.
 
 The second answer is that the odds that SOME mahout
 application will be too
 large for a single node are quite high.
 
 These aren't contradictory.  They just describe the
 long-tail nature of
 problem sizes.
 
 One question I have about your plan is whether your step
 (1) involves
 building trees or forests only from data held in memory or
 whether it can be
 adapted to stream through the data (possibly several
 times).  If a streaming
 implementation is viable, then it may well be that
 performance is still
 quite good for small datasets due to buffering.
 
 If streaming works, then a single node will be able to
 handle very large
 datasets but will just be kind of slow.  As you point
 out, that can be
 remedied trivially.
 
 Another way to put this is that the key question is how
 single node
 computation scales with input size.  If the scaling is
 relatively linear
 with data size, then your approach (3) will work no matter
 the data size.
 If scaling shows an evil memory size effect, then your
 approach (2) would be
 required for large data sets.
 
 On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  My question is : when Mahout.RF will be used in a real
 application, what
  are the odds that the dataset will be so large that it
 can't fit on every
  machine of the cluster ?
 
  the answer to this question should help me decide
 which implementation I'll
  choose.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve
 
 111 West Evelyn Ave. Ste. 202
 Sunnyvale, CA 94086
 www.deepdyve.com
 408-773-0110 ext. 738
 858-414-0013 (m)
 408-773-0220 (fax)
 





Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
Indeed.  And those datasets exist.

It is also plausible that this full data scan approach will fail when you
want the forest building to take less time.

It is also plausible that a full data scan approach fails to improve enough
on a non-parallel implementation.  This would happen if a significantly
large fraction of the entire forest could be built on a single node.  That
would happen if the CPU requirements for forest building are overshadowed by
the I/O cost of scanning the data set.  This would imply that there is a
small limit to the amount of parallelism that would help.

You will know much more about this after you finish the non-parallel
implementation than either of us knows now.

On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 There is still one case that this approach, even out-of-core, cannot
 handle: very large datasets that cannot fit in the node hard-drive, and thus
 must be distributed across the cluster.




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] random forests

2009-03-30 Thread Ted Dunning
I suggest that we all learn from the experience you are about to have on the
reference implementation.

And, yes, I did mean the reference implementation when I said
non-parallel.  Thanks for clarifying.

On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 What do you suggest ?

 And just to make sure, by 'non-paralel implementation' you mean the
 reference implementation, right ?




-- 
Ted Dunning, CTO
DeepDyve


Re: [gsoc] random forests

2009-03-29 Thread Ted Dunning
I have two answers for you.

The first is that for any given application, the odds that the data will not
fit in a single machine are small, especially if you have an out-of-core
tree builder.  Really, really big datasets are increasingly common, but are
still a small minority of all datasets.

The second answer is that the odds that SOME mahout application will be too
large for a single node are quite high.

These aren't contradictory.  They just describe the long-tail nature of
problem sizes.

One question I have about your plan is whether your step (1) involves
building trees or forests only from data held in memory or whether it can be
adapted to stream through the data (possibly several times).  If a streaming
implementation is viable, then it may well be that performance is still
quite good for small datasets due to buffering.

If streaming works, then a single node will be able to handle very large
datasets but will just be kind of slow.  As you point out, that can be
remedied trivially.

Another way to put this is that the key question is how single node
computation scales with input size.  If the scaling is relatively linear
with data size, then your approach (3) will work no matter the data size.
If scaling shows an evil memory size effect, then your approach (2) would be
required for large data sets.

On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote:

 My question is : when Mahout.RF will be used in a real application, what
 are the odds that the dataset will be so large that it can't fit on every
 machine of the cluster ?

 the answer to this question should help me decide which implementation I'll
 choose.




-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)


Re: [gsoc] random forests

2009-03-28 Thread deneche abdelhakim

you should read in . 2a

. This implementation is, relatively, easy given...

--- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit :

 De: deneche abdelhakim a_dene...@yahoo.fr
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Samedi 28 Mars 2009, 16h14
 
 I'm actually writing my working plan, and it looks like
 this:
 
 *
 1. reference implementation of Random/Regression Forests
 Building Algorithm: 
  . Build a forest of trees, the basic algorithm (described
 in the wiki) takes a subset from the dataset as a training
 set and builds a decision tree. This basic algorithm is
 repeated for each tree of the forest. 
  . The forest is stored in a file, this way it can be used
 later to classify new cases
 
 2a. distributed Implementation A: 
  . When the dataset can be replicated to all computing
 nodes.
  . Each mapper has access to the whole dataset, if the
 forest contains N trees and we have M mappers, each mapper
 runs the basic building algorithm N/M times.
  . This implementation is, relatively, given that the
 reference implementation is available, because each mapper
 runs the basic building algorithm as it is.
 
 2b. Distributed Implementation B:
  . When the dataset is so big that it can no longer fit on
 every computing node, it must be distributed over the
 cluster. 
  . Each mapper has access to a subset from the dataset,
 thus all the mappers collaborate to build each tree of the
 forest.
  . In this case, the basic algorithm must be rewritten to
 fit in the map-reduce paradigm.
 
 3. Run the Random Forest with a real dataset on EC2:
  . This step is important, because running the RF on a
 local dual core machine is way different from running it on
 a real cluster with a real dataset.
  . This can make for a good demo for Mahout
 
 4. If there is still time, implement one or two other
 important features of RFs such as Variable importance and
 Proximity estimation
 *
 
 It is clear from the plan that I won't be able to do all
 those steps, and in some way I must choose only one
 implementation (2a or 2b) to do. The first implementation
 should take less time to implement than 2b and I'm quite
 sure I can go up to the 4th step, adding other features to
 the RF. BUT the second implementation is the only one
 capable of dealing with very large distributed datasets.
 
 My question is : when Mahout.RF will be used in a real
 application, what are the odds that the dataset will be so
 large that it can't fit on every machine of the cluster ? 
 
 the answer to this question should help me decide which
 implementation I'll choose.
 
 --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com
 a écrit :
 
  De: Ted Dunning ted.dunn...@gmail.com
  Objet: Re: [gsoc] random forests
  À: mahout-dev@lucene.apache.org
  Date: Dimanche 22 Mars 2009, 0h36
  Great expression!
  
  You may be right about the nose-bleed tendency between
 the
  two methods.
  
  On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim
 a_dene...@yahoo.frwrote:
  
   I can't find a no-nose-bleeding algorithm
  
  
  
  
  -- 
  Ted Dunning, CTO
  DeepDyve
  
 
 
 
 





Re: [gsoc] random forests

2009-03-17 Thread deneche abdelhakim

Yeah, Breinman states that 

at each node, m variables are selected at random out of the M

I modified the wiki page, in LearnUnprunedTree(X,Y) which builds iteratively a 
node at a time, I added this line:

select m variables at random out of the M variables

before searching the best split

For j = 1 .. m
  ...

--- En date de : Lun 16.3.09, Ted Dunning ted.dunn...@gmail.com a écrit :

 De: Ted Dunning ted.dunn...@gmail.com
 Objet: Re: [gsoc] random forests
 À: mahout-dev@lucene.apache.org
 Date: Lundi 16 Mars 2009, 7h26
 Nice writeup.
 
 One thing that I was confused about for a long time is
 whether the choice of
 variables to use for splits is chosen once per tree or
 again at each split.
 
 I think that the latter interpretation is actually the
 correct one.  You
 should check my thought.
 
 On Sun, Mar 15, 2009 at 1:53 AM, deneche abdelhakim a_dene...@yahoo.frwrote:
 
  I added a page to the wiki that describes how to build
 a random forest and
  how to use it to classify new cases.
 
 
 
 
 -- 
 Ted Dunning, CTO
 DeepDyve