[jira] Updated: (MAHOUT-140) In-memory mapreduce Random Forests

Deneche A. Hakim (JIRA) Sun, 12 Jul 2009 02:44:43 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Deneche A. Hakim updated MAHOUT-140:
------------------------------------

    Attachment: mapred_jul12.diff

*Changes*
* The oob error estimation has been rewritten to become much more faster
* BuildForest has an optional argument '-o' to use the optimized IG calculations

I tested the implementation on Amazon EC2:
* on a 1 small instance cluster (1 master + 1 slave), building 50 trees with 
KDD10% takes 44m 45s
* on a 10 small instances cluster (1 master + 10 slaves), building 50 trees 
with KDD10% takes 7m 50s

*what's next*
* Although many improvements are possible, the actual InMem implementation does 
a good job. I shall start coding the other mapreduce variant where each mapper 
uses only the subset of data available to grow the trees

> In-memory mapreduce Random Forests
> ----------------------------------
>
>                 Key: MAHOUT-140
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-140
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Priority: Minor
>         Attachments: mapred_jul12.diff, mapred_patch.diff
>
>
> Each mapper is responsible for growing a number of trees with a whole copy of 
> the dataset loaded in memory, it uses the reference implementation's code to 
> build each tree and estimate the oob error.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-140) In-memory mapreduce Random Forests

Reply via email to