[ 
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deneche A. Hakim updated MAHOUT-122:
------------------------------------

    Attachment: 2w_patch.diff

*second week patch*
work in progress...

*changes:*
* added many tests, although some are still missing

* added a new class "Instance" that allows me to add an ID and a separate LABEL 
to a Vector

* DataLoader.loadData(String, FileSystem, Path) loads the data from a file, 
IGNORED attributes are skipped

* Dataset handles only NUMERICAL and CATEGORICAL attributes

  ** contains List<String> that represents the labels as found in the data, 
before being converted to int

* added a new class "Data" that represents the data being loaded

  ** contains methods to create subset from the current Data

  ** the only way to get a new Data instance is to load it with DataLoader, or 
to use methods from an existing Data instance

  ** this class could prove useful later to optimize the memory usage of the 
data

* ForestBuilder.buildForest uses a PredictionCallback to collect the oob 
predictions, by changing the callback we can compute different errors rate, for 
example:

  ** Forest out-of-bag error estimation

  ** mean tree error rate

  ** ...

* I added a small running example in ForestBuilder.main(), this example shows a 
typical use of Random Forests:

  ** loads the data from a file, you'll need to provide a descriptor. For 
example UciDescriptors.java contains the descriptors for the "glass" and 
"post-operative" UCI datasets, the datasets are available at the UCI web site)

  ** reserves 10% of the data as a test set (not used for now)

  ** builds a random forest using the remaining data

  ** computes the oob error estimation

  ** this procedure is repeated 100 times and the mean oob error estimation is 
printed

if you want to try the example, you'll need to download the "post-operative" 
dataset, or the "glass" dataset from UCI, put it somewhere, and change the 
first line of ForestBuilder.main() to the correct path, and use the 
corresponding UciDescriptor in the third line.

*Note about memory usage:*

* the reference implementation loads the data in-memory, then builds the trees 
one at a time

* each tree is built recursively using DecisionTree.learnUnprunedTree(), at 
each node the data is split and learnUnprunedTree() is called for each subset

* the current implementation of "Data" is not memory efficient, each subset 
keeps it own copy of its part of the data, thus, except when there are Leaf 
nodes, each level of the tree generates one more copy of the data in memory

*Whats next:*

* RandomForest class that will contain the result of the forest building, can 
be stored/loaded from a file

* try the implementation on the same UCI datasets as the Breiman's paper, using 
the same complete procedure

* do some memory usage monitoring

> Random Forests Reference Implementation
> ---------------------------------------
>
>                 Key: MAHOUT-122
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-122
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>         Attachments: 2w_patch.diff, RF reference.patch
>
>   Original Estimate: 25h
>  Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to 
> understand, reference implementation of Random Forests (Building and 
> Classification). The only requirement here is that "it works"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to